Local Optimization

The simplest form of optimization
Optimize one basic block. (No need to analyze the whole procedure body).

Some statements can be deleted

x := x + 0
x := x * 1

Eliminating these instructions are great optimizations as we are removing an entire instruction.

Some statements can be simplified

x := x * 0  =>  x := 0

Maybe the instruction on the right is faster than the instruction on the right. More importantly, assigning to a constant allows more cascading optimizations, as we will see later.

Another example:

y := y ** 2  => y := y * y

Typically, an architecture will not have the exponentiation instruction and so this operator may have to be implemented by an exponentiation loop in software. For the special cases (e.g., exponent = 2), we can simply replace by multiplication.

Another example:

x := x * 8  => x := x << 3
x := x * 15  => t := x << 4; x := t - x

Replace multiplication by a power-of-two to a shift-left operation. Can also do this for non powers-of-two. On some machines, left-shift is faster than multiply, but not on all! Modern machines have their own "rewriting logic" in hardware that can deal with these special cases really fast. In other words, some of these compiler local optimizations may become less relevant on modern hardware.

All these transformations are examples of algebraic simplifications.

Operations on constants can be computed at compile time

If there is a statement x := y op z
And y and z are constants.
Then y op z can be computed at compile time.

Examples

x:= 2 + 2 can be changed to x := 4
if 2 < 0 jump L can be deleted.
if 2 > 0 jump L can be replaced by jump L

This class of optimizations is called constant folding. One of the most consequential and most common optimizations performed by compilers.

Another important optimization: eliminate unreachable basic blocks (dead code):

Code that is unreachable from the initial block
- e.g., basic blocks that are not the target of any jump or "fall through" from a conditional
Removing unreachable code makes the program smaller
- And sometimes also faster
  - Due to memory cache effects
  - Increased spatial locality

Called dead-code elimination.

Why would unreachable basic blocks occur?

e.g., if (DEBUG) then { .... }
- Need constant propagation followed with dead-code elimination to remove this whole code.
Libraries: a library may supply a 100 methods but we are using only a 3 of those methods, the other 97 are dead-code.
Usually other optimizations may result in more dead code

Some optimizations are simplified if each register occurs only once on the left-hand side of an assignment.

Rewrite intermediate code in single assignment form

x := z + y
a := x
z := 2 * x

Change to

b := z + y
a := b
x := 2 * b

More complicated in general, due to loops

Basic block is in single assignment form
A definition x := is the first use of x in a block

Then

When two assignments have the same rhs, they compute the same value.

Example

x := y + z
....
....
w := y + z

We can be sure that the values of x, y, and z do not change in the code. Hence, we can replace the assignment to w as follows:

x := y + z
....
....
w := x

This optimization is called common-subexpression elimination. This is also another very common and consequential compiler optimization.

If we see w := x appears in a block, replace subsequent uses of w with uses of x

Assumes single assignment form

Example:

b := z + y
a := b
x := 2 * a

This can be replaced with

b := z + y
a := b
x := 2 * b

This is called copy propagation. This is useful for enabling other optimizations

Constant folding. e.g., if the propagated value is a constant
Dead code elimination. e.g., the copy statement can be eliminated as it is inconsequential.

Example:

a := 5
x := 2 * a
y := x + 6
t := x * y

gets transformed to

a := 5
x := 10
y := 16
t := x >> 4

We could have also assigned 160 to t. That would be even better.

If w := rhs appears in a basic block and w does not appear anywhere else in the program, THEN the statement w := rhs is dead and can be eliminated.

Dead: does not contribute to the program's result

In our copy-propagation example, one of the assignments is dead and can be eliminated.

Why Static Single Assignment (SSA)?

Optimization algorithms become simpler and more efficient if each variable has only one definition: instead of tracking (variable-value-at-PC), simply track (variable-value).
Unrelated uses of the same variable become independent (eliminates Write-After-Write dependencies).
More values become available at each program point.

Converting to SSA

Replace the target of each assignment with a new variable
Replace each use of a variable with the version of the variable reaching that point.

Show two acyclic examples: one straight-line and one with a diamond structure.

The need for phi nodes: show a diamond structure where two different versions of the same variable reach the same point. Define y3=phi(y1,y2) at the beginning of the meet point.

Phi function chooses the version depending on the incoming edge.
The phi node is present only at the beginning of a basic block.

Inserting PHI nodes: Path convergence criterion: need Phi node for a variable a at node z iff:

There is a block x containing definition of a.
There is a block y!=x containing a definition of a.
There are non-empty paths Pxz and Pyz from x-to-z and y-to-z respectively.
Pxz and Pyz don't have any node in common except z.
The node z does not appear within both Pxz and Pyz prior to the end. Ok if z appears in only one.

Each local optimization does little by itself
Typical optimizations interact
- Performing one optimization enables another
Optimizing compilers can be thought of as a big bag of tricks.
One approach is to repeat optimizations until no improvement is possible --- called a fixpoint algorithm.
- The optimizer can also be stopped at any point to limit compilation time
- To ensure that we converge (and not oscillate) in this fixed point procedure, each trick should be monotonic, i.e., with respect to some global cost function (for the whole program), each optimization should only decrease the cost of a program.
- In practice, it is expensive to run till a fixpoint is reached. Further, it is hard to ensure that each transformation is monotonic for a large production compiler.

Example

a := x ** 2
b := 3
c := x
d := c * c
e := b * 2
f := a + d
g := e + f

Final form

a := x * x
f := a + a
g := 6 * f

Possible to simplify further, but the compiler may get stuck in "local minima". e.g., should it have changed a + a to 2 * a, it would have had a better optimization opportunity (replacing g = 12*f and eliminating computation of f). This is also called the phase ordering problem.

Peephole optimization

An optimization can be expressed as a pattern-replacement rule or a peephole optimization, which says that if a code pattern is seen in the input program, we may replace it with a code replacement. In general, the pattern and replacement can be arbitrary. However, a peephole optimization typically limits the shapes of patterns and replacements to ensure that it remains easy (and possible) to check a pattern and compute a replacement. Here is an example of the shape of a pattern-replacement rule:

i1; i2; i3; ..; in --> j1; j2; ..; jm

This is the most basic size of pattern, where the exact instructions are used for specifying the pattern (and replacement), and there is no precondition. Here is an example of a peephole optimization of this shape:

move $a $b; move $b $a --> move $a $b   (first operand is the destination)

In this case, the names of the registers are abstracted by using identifiers a and b, i.e., a could match with any register name r1, r2, ... It may be desirable to generalize a so it represents either a register or a memory location. But that would make it harder to ensure that the peephole optimization rules are correct, and more importantly, make it harder for the compiler to apply the peephole optimization ---- complex patterns are harder to match. An preferrable alternative is to use multiple peephole optimizations, e.g., an optimization that tackles memory accesses could be:

move 0($a) $b; move $b 0($a) --> move 0($a) $b

The rule above requires that the immediate offset in the memory address must be zero. This can be generalized to:

move $i($a) $b; move $b $i($a) --> move $i($a) $b

where $i is an arbitrary immediate constant permitted by the move instruction with a memory operand. For example, the immediate operand may be constrained to be a 16-bit signed constant.

Some peephole optimizations are strictly more general than others, in which case, the general versions should be preferred. Consider the following example peephole optimization:

add $a $a 1; add $a $a 1 --> add $a $a 2

A more general version of the optimization above is:

add $b $a 1; add $b $b 1 --> add $b $a 2

Notice that it is possible for both a and b to pattern match with the same register at application time.

Patterns for constants allow the specification of constant-folding optimizations as peephole optimizations, e.g.,:

move $a $i; move $b $a --> move $a $i; move $b $i

This transformation removes a dependency on $a and could potentially cause dead-code elimination in the future. In fact, if we generalize the notion of a peephole optimization, so it also includes a postcondition that specifies if a value can potentially used in the future or not, or in other words, if a value is live at the end of the pattern's instruction sequence or not.

post: $a is not live (cannot be used in the future)
move $a $i; move $b $a --> move $b $i

Such generalizations can be used for more nuanced postconditions, e.g., we could track liveness at bit-granularity, i.e., if bit b of register a is live or not. Here is an example of a peephole optimization that employs liveness at bit-granularity:

post: bit 0 of $a is not live (cannot be used in the future)
addiu $a $a $a; addiu $a $a $1 --> shl $a 1

In general, a postcondition can be made arbitrarily sophisticated, e.g., one could potentially want to specify that a tranformation is legal if the value of $a+$b can be used in the future but the individual values of $a and $b are not required --- i.e., (1,2) and (3,0) are indistinguishable as the output values of $a,$b. However, such sophisticated postconditions become increasingly hard to ascertain for the compiler while deciding when a peephole optimization is applicable. In practice, current systems either do not use postconditions, or restrict themselves to liveness (either for full registers or for individual bits). Just like a peephole optimization can be generalized using a postcondition, a peephole optimization may require a precondition. Consider the following example:

addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)

This transformation allows i and j to pattern-match with arbitrary 16-bit constants (assuming addiu allows 16-bit constants). Further, it allows the replacement to specify functions over the input constants, e.g., addition. However, this optimization is incorrect as it is possible for i+j to potentially overflow the bitwidth of an immediate constant (16 bits); thus, we need a precondition that the constants should be such that i+j does not overflow.

precondition: i+j does not overflow
addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)

In general, a peephole optimization could use both precondition and postcondition, and can become sophisticated enough. In practice, the level of sophistication of rules is mostly limited to a sequence of few instructions in the source pattern and a sequence of few instructions in the target replacement, with easily-checkable specifications of preconditions and postconditions. Postconditions are usually limited to liveness.

A peephole optimization can be applied at any stage of the compilation pipeline. For example, while compiling a high-level language like TensorFlow, there are multiple representations: graph representation (that represents the program as a graph of operators), MLIR affine dialect (that represents high-level operators like matrix multiplication and affine access operators that are amenable to polyhedral analysis), LLVM, MachineIR (MIR-x86), and x86 assembly. A peephole optimization can be specified for any of these levels. However, applying a peephole optimization at a higher level is fraught with the danger that it may preclude certain optimizations in the future translation and optimization passes of the compiler pipeline --- a peephole optimization may encode a variety of rules and a compiler does not typically analyze them to see if there is any possibility of future preclusion of optimization. For this reason, peephole optimizations were originally proposed for the assembly level. Today, hundreds of peephole optimizations are used in LLVM, both at the IR and the assembly (MIR-x86) level.

Many (but not all) of the basic block optimizations can be cast as peephole optimizations

Example: addiu $a $b 0 -~-> move $a $b
Example: move $a $a -~->
These two together eliminate addiu $a $a 0.

As for local optimizations, peephole optimizations must be applied repeatedly for maximum effect

Need to ensure that the replacement rules cannot cause oscillations. e.g., each replacement rule can only "improve" the code.

Many simple optimizations can still be applied on assembly language.

"Program optimiztion" is grossly misnamed

Code "optimizers" have no intention or even pretense of working towards the optimal program.
Code "improvers" is a better term.