Performance is always with respect to a program. e.g., program 1 runs faster on computer A than computer B. program 2 runs faster on computer B than computer A.

Time equation: Time = num-instns/program * num-cycles/instn * seconds/cycle

Performance equation: Speed = program/num-instns * instn/num-cycles * cycle/seconds

num-instns/program depends on the programmer and on the compiler

seconds/cycle is the frequency and depends on the physical designer of the processor. It is also possible to increase frequency by making the cycles/instruction bigger, that is the operation performed in each cycle smaller (and making the pipeline longer). But we are limited by the slowest operation. Typically frequency is a direct function of transistor technology (out of scope for this course). Further, increasing frequency increases power by a cubic factor. Similarly the thermal power dissipation increases proportionally to consumed power. Shorter pipeline stages would also require more forwarding logic

cycles/instruction depends on the architecture, also called CPI (cycles per instruction), or more commonly its reciprocal IPC (instructions per cycle)

Inorder pipeline: IPC = 1 if there are no stalls. In general, CPI = CPI_ideal + stall_rate*stall_penalty

Increasing IPC:

  • In-order superscalar processor

    Out of order superscalar processor

    movl %r1,%r2
    addl %r2,%r3
    subl (%r3),%r4
    movl %r5,%r6
    subl $1,%r6
    addl %r7,(%r6)
    
    The last three instructions can execute in parallel with the first three instructions.

    Basic idea:

    Out of order superscalar processor basics

    Problem: how do we create a large pool of instructions in a program with branches? We need to be sure that the instructions are on the correct path.

    for (i = 1; i < m; i++) {
    	for (j = 1; j < i; j ++ ) {
    		if (j % 2 == 0) continue;
    		....
    	}
    }
    

    Typically: one in five instructions is a branch. Solution: predict the direction of the branches, and their targets. (will discuss more later).

    Dependencies

    Inorder processors respect program order, so they automatically respect all data and control dependencies. Out of order processors do not respect program order, and hence need to ensure that they respect data and control dependencies.

    Output and anti-dependencies can be handled by creating extra "registers" on the fly, also called register renaming

    Slowly recycle these extra registers (if you are sure that they are not used; or commit them to their original register). Need to have a limit on the number of these extra virtual registers. If there are enough virtual registers, all WAW and WAR hazards are eliminated; but RAW still remain.

    Semantics vs. implementation:

    Consider the following example:
    add ...
    div ... //causes an exception
    sub ...
    
    Consider another example where an external interrupt is received. What are some valid responses?

    Precise exceptions

    Branch prediction: general structure is a function that takes the PC as input and returns a bool (taken/not-taken) as output. For indirect branches, it also predicts the branch target. Basic idea: past = future