Local Optimization

Some statements can be deleted

x := x + 0
x := x * 1
Eliminating these instructions are great optimizations as we are removing an entire instruction.

Some statements can be simplified

x := x * 0  =>  x := 0
Maybe the instruction on the right is faster than the instruction on the right. More importantly, assigning to a constant allows more cascading optimizations, as we will see later.

Another example:

y := y ** 2  => y := y * y
Typically, an architecture will not have the exponentiation instruction and so this operator may have to be implemented by an exponentiation loop in software. For the special cases (e.g., exponent = 2), we can simply replace by multiplication.

Another example:

x := x * 8  => x := x << 3
x := x * 15  => t := x << 4; x := t - x
Replace multiplication by a power-of-two to a shift-left operation. Can also do this for non powers-of-two. On some machines, left-shift is faster than multiply, but not on all! Modern machines have their own "rewriting logic" in hardware that can deal with these special cases really fast. In other words, some of these compiler local optimizations may become less relevant on modern hardware.

All these transformations are examples of algebraic simplifications.

Operations on constants can be computed at compile time

Examples

This class of optimizations is called constant folding. One of the most consequential and most common optimizations performed by compilers.

Another important optimization: eliminate unreachable basic blocks (dead code):

Called dead-code elimination.

Why would unreachable basic blocks occur?

Some optimizations are simplified if each register occurs only once on the left-hand side of an assignment.

More complicated in general, due to loops

If

Then Example
x := y + z
....
....
w := y + z
We can be sure that the values of x, y, and z do not change in the code. Hence, we can replace the assignment to w as follows:
x := y + z
....
....
w := x
This optimization is called common-subexpression elimination. This is also another very common and consequential compiler optimization.

If we see w := x appears in a block, replace subsequent uses of w with uses of x

Example:
b := z + y
a := b
x := 2 * a
This can be replaced with
b := z + y
a := b
x := 2 * b
This is called copy propagation. This is useful for enabling other optimizations

Example:

a := 5
x := 2 * a
y := x + 6
t := x * y
gets transformed to
a := 5
x := 10
y := 16
t := x >> 4
We could have also assigned 160 to t. That would be even better.

If w := rhs appears in a basic block and w does not appear anywhere else in the program, THEN the statement w := rhs is dead and can be eliminated.

In our copy-propagation example, one of the assignments is dead and can be eliminated.

Why Static Single Assignment (SSA)?

Converting to SSA

Show two acyclic examples: one straight-line and one with a diamond structure.

The need for phi nodes: show a diamond structure where two different versions of the same variable reach the same point. Define y3=phi(y1,y2) at the beginning of the meet point.

Inserting PHI nodes: Path convergence criterion: need Phi node for a variable a at node z iff:

Example

a := x ** 2
b := 3
c := x
d := c * c
e := b * 2
f := a + d
g := e + f

Final form

a := x * x
f := a + a
g := 6 * f
Possible to simplify further, but the compiler may get stuck in "local minima". e.g., should it have changed a + a to 2 * a, it would have had a better optimization opportunity (replacing g = 12*f and eliminating computation of f). This is also called the phase ordering problem.

Peephole optimization

An optimization can be expressed as a pattern-replacement rule or a peephole optimization, which says that if a code pattern is seen in the input program, we may replace it with a code replacement. In general, the pattern and replacement can be arbitrary. However, a peephole optimization typically limits the shapes of patterns and replacements to ensure that it remains easy (and possible) to check a pattern and compute a replacement. Here is an example of the shape of a pattern-replacement rule:
i1; i2; i3; ..; in --> j1; j2; ..; jm
This is the most basic size of pattern, where the exact instructions are used for specifying the pattern (and replacement), and there is no precondition. Here is an example of a peephole optimization of this shape:
move $a $b; move $b $a --> move $a $b   (first operand is the destination)
In this case, the names of the registers are abstracted by using identifiers a and b, i.e., a could match with any register name r1, r2, ... It may be desirable to generalize a so it represents either a register or a memory location. But that would make it harder to ensure that the peephole optimization rules are correct, and more importantly, make it harder for the compiler to apply the peephole optimization ---- complex patterns are harder to match. An preferrable alternative is to use multiple peephole optimizations, e.g., an optimization that tackles memory accesses could be:
move 0($a) $b; move $b 0($a) --> move 0($a) $b
The rule above requires that the immediate offset in the memory address must be zero. This can be generalized to:
move $i($a) $b; move $b $i($a) --> move $i($a) $b
where $i is an arbitrary immediate constant permitted by the move instruction with a memory operand. For example, the immediate operand may be constrained to be a 16-bit signed constant.

Some peephole optimizations are strictly more general than others, in which case, the general versions should be preferred. Consider the following example peephole optimization:

add $a $a 1; add $a $a 1 --> add $a $a 2
A more general version of the optimization above is:
add $b $a 1; add $b $b 1 --> add $b $a 2
Notice that it is possible for both a and b to pattern match with the same register at application time.

Patterns for constants allow the specification of constant-folding optimizations as peephole optimizations, e.g.,:

move $a $i; move $b $a --> move $a $i; move $b $i
This transformation removes a dependency on $a and could potentially cause dead-code elimination in the future. In fact, if we generalize the notion of a peephole optimization, so it also includes a postcondition that specifies if a value can potentially used in the future or not, or in other words, if a value is live at the end of the pattern's instruction sequence or not.
post: $a is not live (cannot be used in the future)
move $a $i; move $b $a --> move $b $i
Such generalizations can be used for more nuanced postconditions, e.g., we could track liveness at bit-granularity, i.e., if bit b of register a is live or not. Here is an example of a peephole optimization that employs liveness at bit-granularity:
post: bit 0 of $a is not live (cannot be used in the future)
addiu $a $a $a; addiu $a $a $1 --> shl $a 1
In general, a postcondition can be made arbitrarily sophisticated, e.g., one could potentially want to specify that a tranformation is legal if the value of $a+$b can be used in the future but the individual values of $a and $b are not required --- i.e., (1,2) and (3,0) are indistinguishable as the output values of $a,$b. However, such sophisticated postconditions become increasingly hard to ascertain for the compiler while deciding when a peephole optimization is applicable. In practice, current systems either do not use postconditions, or restrict themselves to liveness (either for full registers or for individual bits). Just like a peephole optimization can be generalized using a postcondition, a peephole optimization may require a precondition. Consider the following example:
addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)
This transformation allows i and j to pattern-match with arbitrary 16-bit constants (assuming addiu allows 16-bit constants). Further, it allows the replacement to specify functions over the input constants, e.g., addition. However, this optimization is incorrect as it is possible for i+j to potentially overflow the bitwidth of an immediate constant (16 bits); thus, we need a precondition that the constants should be such that i+j does not overflow.
precondition: i+j does not overflow
addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)

In general, a peephole optimization could use both precondition and postcondition, and can become sophisticated enough. In practice, the level of sophistication of rules is mostly limited to a sequence of few instructions in the source pattern and a sequence of few instructions in the target replacement, with easily-checkable specifications of preconditions and postconditions. Postconditions are usually limited to liveness.

A peephole optimization can be applied at any stage of the compilation pipeline. For example, while compiling a high-level language like TensorFlow, there are multiple representations: graph representation (that represents the program as a graph of operators), MLIR affine dialect (that represents high-level operators like matrix multiplication and affine access operators that are amenable to polyhedral analysis), LLVM, MachineIR (MIR-x86), and x86 assembly. A peephole optimization can be specified for any of these levels. However, applying a peephole optimization at a higher level is fraught with the danger that it may preclude certain optimizations in the future translation and optimization passes of the compiler pipeline --- a peephole optimization may encode a variety of rules and a compiler does not typically analyze them to see if there is any possibility of future preclusion of optimization. For this reason, peephole optimizations were originally proposed for the assembly level. Today, hundreds of peephole optimizations are used in LLVM, both at the IR and the assembly (MIR-x86) level.

Many (but not all) of the basic block optimizations can be cast as peephole optimizations

As for local optimizations, peephole optimizations must be applied repeatedly for maximum effect

Many simple optimizations can still be applied on assembly language.

"Program optimiztion" is grossly misnamed