Inorder pipeline : Fetch, Decode, Execute, Memory access, Register writeback.
Hazards
- Structural hazards: two instructions contend for the same resource, e.g., ALU (not possible in an inorder pipeline)
- Data hazards: the consumer instruction needs to wait for the producer instruction to generate the data. e.g., arithmetic instruction followed by a memory load; even with pipelining and forwarding, the arithmetic instruction needs to stall for the memory load to finish.
- Control hazards: Do not know which branch to fetch instructions from; need to wait for the control instruction to finish before the next instruction is fetched. Example: conditional branch followed by arithmetic instruction; we know the result of the branch instruction only after the execute stage; two clock cycles wasted or stalled in this case

Performance is always with respect to a program. e.g., program 1 runs faster on computer A than computer B. program 2 runs faster on computer B than computer A.

Time equation: Time = num-instns/program * num-cycles/instn * seconds/cycle

Performance equation: Speed = program/num-instns * instn/num-cycles * cycle/seconds

num-instns/program depends on the programmer and on the compiler

seconds/cycle is the frequency and depends on the physical designer of the processor. It is also possible to increase frequency by making the cycles/instruction bigger, that is the operation performed in each cycle smaller (and making the pipeline longer). But we are limited by the slowest operation. Typically frequency is a direct function of transistor technology (out of scope for this course). Further, increasing frequency increases power by a cubic factor. Similarly the thermal power dissipation increases proportionally to consumed power. Shorter pipeline stages would also require more forwarding logic

In the limit, we are constrained by the delay imposed by a latch that is needed to transfer the contents of one stage to another.

cycles/instruction depends on the architecture, also called CPI (cycles per instruction), or more commonly its reciprocal IPC (instructions per cycle)

Inorder pipeline: IPC = 1 if there are no stalls. In general, CPI = CPI_ideal + stall_rate*stall_penalty

With increasing number of pipeline stages, the stall penalty increases. Stall rate remains largely similar

Increasing IPC:

Issue more instructions per cycle: 2, 4, or 8 instructions
Make it a superscalar processor: a processor that can execute multiple instructions per cycle. Show diagram with parallel five-stage pipelines
Different instructions may be executing on different resources
- The processor needs to schedule and track data and control dependencies.

In-order superscalar processor

There can be dependencies between instructions.
Have O(n^2) forwarding paths for a n-issue processor.
Complex logic for detecting dependencies, hazards and forwarding.
However
- Even with all this, all stages are idle if consecutive n instructions have dependencies.
- In fact, this is the common case, as usually instructions that are dependent are close to each other (often consecutive).

Out of order superscalar processor

movl %r1,%r2
addl %r2,%r3
subl (%r3),%r4
movl %r5,%r6
subl $1,%r6
addl %r7,(%r6)

The last three instructions can execute in parallel with the first three instructions.

Basic idea:

Create a pool of instructions
Find instructions that are mutually independent and have all their operands ready.
Execute them out of order

Out of order superscalar processor basics

Fetch a pool of instructions in a buffer, also known as the reorder buffer.
The reorder buffer (ROB) size needs to be large enough so that the desired number of mutually independent instructions can be found. Typical sizes 64 to 192. Larger ROB size means greater parallelism but more complex circuitry

Problem: how do we create a large pool of instructions in a program with branches? We need to be sure that the instructions are on the correct path.

for (i = 1; i < m; i++) {
	for (j = 1; j < i; j ++ ) {
		if (j % 2 == 0) continue;
		....
	}
}

Typically: one in five instructions is a branch. Solution: predict the direction of the branches, and their targets. (will discuss more later).

Dependencies

RAW
WAW : output dependency
WAR : anti dependency
Control

Inorder processors respect program order, so they automatically respect all data and control dependencies. Out of order processors do not respect program order, and hence need to ensure that they respect data and control dependencies.

Output and anti-dependencies can be handled by creating extra "registers" on the fly, also called register renaming

Every time there is a write, just create a new register; all reads will now read from this new register.

Slowly recycle these extra registers (if you are sure that they are not used; or commit them to their original register). Need to have a limit on the number of these extra virtual registers. If there are enough virtual registers, all WAW and WAR hazards are eliminated; but RAW still remain.

Semantics vs. implementation:

The programmer should not care (from a correctness standpoint) whether the processor is inorder or OOO. The meaning of the program should remain identical
What if there is an interrupt or an exception in the middle of execution?

Consider the following example:

add ...
div ... //causes an exception
sub ...

Consider another example where an external interrupt is received. What are some valid responses?

Precise exceptions

Branch prediction: general structure is a function that takes the PC as input and returns a bool (taken/not-taken) as output. For indirect branches, it also predicts the branch target. Basic idea: past = future

Remember the type of branch: unconditional, conditional, function call, function return.
Make a fast hardware structure to remember (say 10x faster than actual execution): can give wrong answers but should have a high probability of giving the right answer.
Example: use the last n bits of the PC value