Affine loop transformations through a polyhedral framework

Focus on techniques for optimizing the class of numerical applications that use arrays as data structures and access them with simple regular patterns, i.e., affine array accesses with respect to surrounding loop indices.
Affine function (or linear function) is a function of one or more variables x1, .., xn, that can be expressed as a sum of a constant plus constant multiples of the variables: sum_i (c_i * x_i) + c0
A simple example:
```
int A[200];
for (i = 0; i < 100; i++) {
  A[2*i] = 4;
}
```
Here the array address is an affine function of the loop iteration variable i.
- Can perform each iteration in parallel with other iterations in this example
- If there was another access A[j] = 0, then we cannot parallelize this loop because we cannot say if A[2*i] may be the same location (may alias) with A[j]. We have to make a conservative assumption; notice that A[j] is not an affine function of the loop iteration variables.
- Similarly, if our body consists of:
```
A[2*i] = (A[2*i - 1] * 31) % 23;
```
  We cannot parallelize this loop either (even though all accesses are affine) because there are data-dependencies across different iterations of the loop.

We use linear algebra and integer linear programming techniques to reason about these things and optimize for parallelism and locality.

We assume in this discussion that all loop iteration variables have a lower and an upper bound. Also, we assume that the iteration variables increment by 1 in each iteration (if the variable iterates by more than 1, we can introduce a new iteration variable, and write the old iteration variable in terms of the new variable, such that we satisfy this property). The space of the loop iteration variables can be represented as a convex polyhedron (in d-dimensional space, where d is the loop-nest depth and the number of loop iteration variables). We can reason about the relationships between different iterations by reasoning about relationships between points in the polyhedral space.

Two categories of transformations: affine and blocking.

Affine partitioning splits up the polyhedra of iterations into components that can be potentially executed in parallel, or in general can be reordered with respect to each other.
Blocking creates a hierarchy of iterations, e.g., blocking in our matrix multiplication example above.

Linear algebra techniques help determine best affine partitions and blocking.

Loop-level parallelism example:

for (i = 0; i < n; i++) {
  Z[i] = X[i] * 2;
}

Each iteration is logically independent of other iterations. If there are M processors, each processor p can execute the following code:

b = ceil(n/M);
for (i = b*p; i > min(n,b*(p+1)); i++) {
  Z[i] = X[i] * 2;
}

This type of a program is also called SPMD (single program multiple data) program, i.e., the same code is executed by all the processors but it is parameterized by an identifier (p) unique to each processor. Typically, one processor, known as the master executes all the serial part of the computation, including synchronizing with all the other processors (including itself). In this example, we need barrier synchronization.

Affine Transform Theory

Three spaces

Iteration space : set of dynamic execution instances in a computation, i.e., the set of combinations of values taken on by loop indices.
Data space: set of array elements accessed.
Processor space: set of processors in the system, each assigned a unique id.

Iteration spaces

Often, iteration spaces are rectangular, e.g., matrix multiply. For d-depth loop next, each iteration variable has a lower bound and upper bound (potentially as functions of iteration variables at lower depths); we can identify these bounds, and represent the space as a convex polyhedron. Convex polyhedron: if two points are in the polyhedron, then all points on the line connecting the two points are also in the polyhedron. Example:

for (i = 0; i < 9; i++) {
  for (j = i; j < 7 && j < i + 4; j++) {
    A[i,j] = 0; //(0,0),(0,1),(0,2),(0,3),(1,1),(1,2),(1,3),(1,4),(2,2),(2,3),(2,4),(2,5),(3,3),(3,4),(3,5),(3,6),(4,4),(4,5),(4,6),(5,5),(5,6),(6,6)
  }
}

Draw the polygon that represents the iteration space.

Execution order of loop nests: A sequential execution of the loop nest executes the points in the space in ascending lexicographic order. A vector i=[i1,..in] is < i'=[i'1,..i'n'] iff \exists{m} s.t. m >= 0 and m<min(n,n') and [i1..im]=[i'1..i'm] and i_(m+1)<i'_(m+1).

Matrix formulation: The iteration space can be formulated in matrix form as:

{i \in Z^d | Bi + b >= 0}

B is a dxd integer matrix
b is an integer vctor of length d
0 is a vector of d 0s

Loop invariant variables (aka symbolic constants). Example:

for (i = 0; i <= n; i++) {
  ...
}

We need to consider n as another loop iteration variable:

[ -1 1 ] [ i ]  >= [ 0 ]
[ 1  0 ] [ n ]        [ 0 ]

Notice that while n is a loop iteration variable, its only constraint is that it is greater than or equal to i.

This representation of the iteration space has no notion of the order in which the iteration space needs to be sweeped. If we can determine that the loop iterations are independent (for example), we can completely alter the order of sweeping, e.g., by swapping the inner and outer loops of the previous example:

for (j = 0; j <= 6; i++) {
  for (i = max(j-3, 0); i <= j; j++) {
    A[i,j] = 0; //(0,0),(0,1),(0,2),(0,3),(1,1),(1,2),(1,3),(1,4),(2,2),(2,3),(2,4),(2,5),(3,3),(3,4),(3,5),(3,6),(4,4),(4,5),(4,6),(5,5),(5,6),(6,6)
  }
}

To do this, we need to first identify the potential values that can be taken by j, and then identify the potential values that can be taken by i for a given value of j. The is done by projecting the convex polyhedron onto the outer dimension of the space. e.g., the projection of the 2D space in the example above onto the j axis is the line segment from 0 to 7. When we project a 3-dimensional polyhedron to a 2D space, we get a polygon in the 2D space. The concept of projection is quite useful in many of our analyses, including loop bound generation.

Fourier-Motzkin elimination algorithm:

Input: A polyhedron S with variables x1..xn, specified through a set of linear constraints (conjunction). One given variable xm is to be eliminated.
Output: A polyhedron S' with all variables except xm that is the projection of S on dimensions other than mth.
Algorithm:
- For every pair of lower bound and upper bound on xm in C such that:
```
L <= c1*xm
c2*xm <= U
```
  create the new constraint:
```
c2*L < c1*U
```
- S' is the set of constraints S-C, plus all the constraints generated in the step above.

Discuss the example above with this algorithm.

Loop bounds generation: All the inequalities involving the innermost loop index variables are written as the variable's lower and upper bounds. Then project away the dimension representing the innermost loop. In the example above (where i is the outer loop index), by considering i as the innermost loop index, we get:

Li = j-3, 0
Ui = j
Lj = 0
Uj = 7

Now we can visit the iteration space either "horizontally" (lexicographic in (i, j)) or "vertically" (lexicographic in (j, i)).

Changing axes: we can also define a new variable k = j - i and sweep the iteration space in lexicographic order with respect to k and j.

j-k >= j-3
j-k >= 0
j-k <= j
j >= 0
j <= 7

This can be simplified to:

k <= 3,j
k >= 0
j >= 0
j <= 7

This can easily be expressed with j as the outer loop index. If we now want to make k as the outer loop index, we can first specify j's bounds in terms of k, and then project out j:

Lj = 0,k
Uj = 7
Lk = 0
Uk = 3

We get:

for (k = 0; k < 3; k++) {
  for (j = k; j <= 7; j++) {
    A[j-k,j] = 0;
  }
}

Affine array accesses

Array access in a loop is affine if

The bounds of the loop are expressed as affine expressions of the surrounding loop variables and symbolic constants
The index for each dimension of the array is also an affine expression of surrounding loop variables and symbolic constants

Formally, we represent an array access in a loop nest that uses a vector of index variables i by the four-tuple T = <F,f,B,b>; it maps a vector i within the bounds

Bi + b > 0

to the array element location

Fi + f

What are F,f (variables i,j) for

X[i-1] : F = [1 0], f = [-1 0]^T (transpose)
Y[i,j]: two accesses. F =
```
[1 0]
[0 1]
```
f = [0 0]^T
Y[j,j+1]
Y[1,2]
Z[1,i,2*i+j]

In linearized-array representation, each multi-dimensional array access is represented in linearized form, e.g. A[i,j] becomes A[n*i+j] which is non-affine (quadratic). To deal with these cases, we can convert the linear access into a multidimensional access if every access can be decomposed into separate dimensions with the guarantee that none of the components exceeds its bound.