Some statements can be deleted
x := x + 0 x := x * 1Eliminating these instructions are great optimizations as we are removing an entire instruction.
Some statements can be simplified
x := x * 0 => x := 0Maybe the instruction on the right is faster than the instruction on the right. More importantly, assigning to a constant allows more cascading optimizations, as we will see later.
Another example:
y := y ** 2 => y := y * yTypically, an architecture will not have the exponentiation instruction and so this operator may have to be implemented by an exponentiation loop in software. For the special cases (e.g., exponent = 2), we can simply replace by multiplication.
Another example:
x := x * 8 => x := x << 3 x := x * 15 => t := x << 4; x := t - xReplace multiplication by a power-of-two to a shift-left operation. Can also do this for non powers-of-two. On some machines, left-shift is faster than multiply, but not on all! Modern machines have their own "rewriting logic" in hardware that can deal with these special cases really fast. In other words, some of these compiler local optimizations may become less relevant on modern hardware.
All these transformations are examples of algebraic simplifications.
Operations on constants can be computed at compile time
x := y op z
y
and z
are constants.y op z
can be computed at compile time.
Examples
x:= 2 + 2
can be changed to x := 4
if 2 < 0 jump L
can be deleted.if 2 > 0 jump L
can be replaced by jump L
Another important optimization: eliminate unreachable basic blocks (dead code):
Why would unreachable basic blocks occur?
if (DEBUG) then { .... }
Some optimizations are simplified if each register occurs only once on the left-hand side of an assignment.
x := z + y a := x z := 2 * xChange to
b := z + y a := b x := 2 * b
If
x :=
is the first use of x
in a block
x := y + z .... .... w := y + zWe can be sure that the values of
x
, y
, and z
do not change in the code. Hence, we can replace the assignment to w
as follows:
x := y + z .... .... w := xThis optimization is called common-subexpression elimination. This is also another very common and consequential compiler optimization.
If we see w := x
appears in a block, replace subsequent uses
of w
with uses of x
b := z + y a := b x := 2 * aThis can be replaced with
b := z + y a := b x := 2 * bThis is called copy propagation. This is useful for enabling other optimizations
Example:
a := 5 x := 2 * a y := x + 6 t := x * ygets transformed to
a := 5 x := 10 y := 16 t := x >> 4We could have also assigned 160 to
t
. That would be even better.
If w := rhs
appears in a basic block and w
does not appear anywhere else in the program, THEN the statement w := rhs
is dead and can be eliminated.
Why Static Single Assignment (SSA)?
Converting to SSA
The need for phi nodes: show a diamond structure where two different versions of the same variable reach the same point. Define y3=phi(y1,y2) at the beginning of the meet point.
a
at node z iff:
a
.y!=x
containing a definition of a
.
z
.
z
does not appear within both Pxz and Pyz prior to the end. Ok if z
appears in only one.
Example
a := x ** 2 b := 3 c := x d := c * c e := b * 2 f := a + d g := e + f
Final form
a := x * x f := a + a g := 6 * fPossible to simplify further, but the compiler may get stuck in "local minima". e.g., should it have changed a + a to 2 * a, it would have had a better optimization opportunity (replacing g = 12*f and eliminating computation of f). This is also called the phase ordering problem.
i1; i2; i3; ..; in --> j1; j2; ..; jmThis is the most basic size of pattern, where the exact instructions are used for specifying the pattern (and replacement), and there is no precondition. Here is an example of a peephole optimization of this shape:
move $a $b; move $b $a --> move $a $b (first operand is the destination)In this case, the names of the registers are abstracted by using identifiers
a
and b
, i.e., a
could
match with any register name r1
, r2
, ...
It may be desirable to generalize a
so it represents either a register or a memory location. But that would make it harder
to ensure that the peephole optimization rules are correct, and more importantly, make it harder for the compiler to apply
the peephole optimization ---- complex patterns are harder to match. An preferrable alternative is to use multiple peephole
optimizations, e.g., an optimization that tackles memory accesses could be:
move 0($a) $b; move $b 0($a) --> move 0($a) $bThe rule above requires that the immediate offset in the memory address must be zero. This can be generalized to:
move $i($a) $b; move $b $i($a) --> move $i($a) $bwhere
$i
is an arbitrary immediate constant permitted by the move
instruction with a
memory operand. For example, the immediate operand may be constrained to be a 16-bit signed constant.
Some peephole optimizations are strictly more general than others, in which case, the general versions should be preferred. Consider the following example peephole optimization:
add $a $a 1; add $a $a 1 --> add $a $a 2A more general version of the optimization above is:
add $b $a 1; add $b $b 1 --> add $b $a 2Notice that it is possible for both
a
and b
to pattern match with the same register
at application time.
Patterns for constants allow the specification of constant-folding optimizations as peephole optimizations, e.g.,:
move $a $i; move $b $a --> move $a $i; move $b $iThis transformation removes a dependency on
$a
and could potentially cause dead-code elimination in the future.
In fact, if we generalize the notion of a peephole optimization, so it also includes a postcondition
that specifies if a value can potentially used in the future or not, or in other words, if a value is live at
the end of the pattern's instruction sequence or not.
post: $a is not live (cannot be used in the future) move $a $i; move $b $a --> move $b $iSuch generalizations can be used for more nuanced postconditions, e.g., we could track liveness at bit-granularity, i.e., if bit
b
of
register a
is live or not. Here is an example of a peephole optimization that employs liveness at bit-granularity:
post: bit 0
of $a is not live (cannot be used in the future)
addiu $a $a $a; addiu $a $a $1 --> shl $a 1
In general, a postcondition can be made arbitrarily sophisticated, e.g., one could potentially want to specify that a
tranformation is legal if the value of $a+$b
can be used in the future but the individual values of
$a
and $b
are not required --- i.e., (1,2)
and (3,0)
are
indistinguishable as the output values of $a,$b
. However, such sophisticated postconditions become
increasingly hard to ascertain for the compiler while deciding when a peephole optimization is applicable. In
practice, current systems either do not use postconditions, or restrict themselves to liveness (either for full
registers or for individual bits).
Just like a peephole optimization can be generalized using a postcondition, a peephole optimization
may require a precondition. Consider the following example:
addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)This transformation allows
i
and j
to pattern-match with arbitrary
16-bit constants (assuming addiu
allows 16-bit constants). Further, it allows
the replacement to specify functions over the input constants, e.g., addition.
However, this optimization is incorrect as it is possible for i+j
to potentially
overflow the bitwidth of an immediate constant (16 bits); thus, we need a precondition that
the constants should be such that i+j
does not overflow.
precondition: i+j does not overflow addiu $b $a $i; addiu $b $b $j --> addiu $b $a $(i+j)
In general, a peephole optimization could use both precondition and postcondition, and can become sophisticated enough. In practice, the level of sophistication of rules is mostly limited to a sequence of few instructions in the source pattern and a sequence of few instructions in the target replacement, with easily-checkable specifications of preconditions and postconditions. Postconditions are usually limited to liveness.
A peephole optimization can be applied at any stage of the compilation pipeline. For example, while compiling a high-level language like TensorFlow, there are multiple representations: graph representation (that represents the program as a graph of operators), MLIR affine dialect (that represents high-level operators like matrix multiplication and affine access operators that are amenable to polyhedral analysis), LLVM, MachineIR (MIR-x86), and x86 assembly. A peephole optimization can be specified for any of these levels. However, applying a peephole optimization at a higher level is fraught with the danger that it may preclude certain optimizations in the future translation and optimization passes of the compiler pipeline --- a peephole optimization may encode a variety of rules and a compiler does not typically analyze them to see if there is any possibility of future preclusion of optimization. For this reason, peephole optimizations were originally proposed for the assembly level. Today, hundreds of peephole optimizations are used in LLVM, both at the IR and the assembly (MIR-x86) level.
Many (but not all) of the basic block optimizations can be cast as peephole optimizations
addiu $a $b 0
-~-> move $a $b
move $a $a
-~-> addiu $a $a 0
.
As for local optimizations, peephole optimizations must be applied repeatedly for maximum effect
Many simple optimizations can still be applied on assembly language.
"Program optimiztion" is grossly misnamed