Data Reuse

From array access functions, we derive two kinds of information: Reuse can further be categorized into self reuse (multiple iterations of the same statement access the same data) or group reuse --- two (potentially same) iterations of different statements access the same data.

Reuse is temporal if exact same data is accessed, and spatial if the same cache-line is accessed.

Example

float Z[n] ;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++) {
    Z[j+1] = (Z[j] + Z[j+1] + Z[j+2])/3;
  }
}
Here, each of Z[j], Z[j + 1], and Z[j + 2] have self spatial-reuse. They also have self temporal-reuse, because the exact same elements are re-used over and over again (once for each iteration of the outer loop).
In addition, they have group reuse, both temporal and spatial.
Even though this loop has 4*n^2 accesses, we only need to bring n/c cachelines into the cache; we drop a factor of n due to self-spatial reuse, a factor of c due to spatial locality, and factor of 4 due to group reuse.

We discuss how to use linear algebra to extract this reuse.

Self reuse

If the data referenced by a static access has k dimensions and the access is nested in a d-depth loop nest, for some d > k, then the same data can be reused n^(d-k) times, where n is the number of iterations in each loop. The dimensionality of an access corresponds to the rank of the coefficient matrix of the access, and we can find which iterations refer to the same location by finding the null space of the matrix.

Fi + f = Fi' + f
Hence,
F(i - i') = 0
Thus two iterations refer to the same element if their difference belongs to the null space of F. Trivial solution is i=i' (not meaningful); we are interested in other non-zero solutions. Discuss fully-ranked matrices.

Self-spatial reuse

Approximation: consider two array elements share a cache-line if and only if they share the same row in a 2-D array; in general, if they differ only in the last dimension for a d-dimension matrix. For this last row, the accesses are anyways cheaper (once every c accessses), so this approximation is meaningful.

Trick: drop the last row of the co-efficient matrix F; if the rank of the resulting truncated matrix is less than the depth of the loop nest, then we can assure spatial locality by making sure that the innermost loop varies only the last coordinate of the array.

Example: A[3,2*i,7*i+j]; if we truncate the last dimension, the rank of the truncated coefficient matrix F is only 1, and hence there is an opportunity for spatial reuse. In this case, if the inner loop iterates on j, then we exploit spatial reuse (but not if the inner loop iterates on i)

General rule for spatial re-use: In order for there to be spatial reuse, the vector [0,0,.. . , 0,1] must be in the null space of the truncated matrix. The reason is that if this vector is in the null space, then when we fix all loop indexes but the innermost one, we know that all dynamic accesses during one run through the inner loop vary in only the last array index. If the array is stored in row-major order, then these elements are all near one another, perhaps in the same cache line.

Group Reuse

We compute reuse only across accesses with same coefficient matrices. If the coefficient matrices are different, then the amount of reuse would be too small to be meaningful.

Given two accesses <F,f1> and <F,f2>, reuse of the same data requires

F*i1 + f1 = F*i2 + f2
or
F(i1-i2) = f2-f1
Suppose v is one solution to the equation, then all solutions of this equation can be represented as v+w where w is in the null-space of F.

Example:

for (i = 1; i <= n; i++) {
  for (j = 1; j <= n; j++) {
    Z[i,j] = Z[i-1,j];
  }
}
has two array accesses both of which have the same coefficient matrix
[1 0]
[0 1]
Each access exhibits self-spatial reuse --- if we erase the bottom row, we get [1 0], and [0 1] is in the nullspace of this matrix. There is no self-temporal reuse. But there is group temporal reuse; formally, discounting the loop bounds, there is reuse if
[1 0] [i1]   [0]    [1 0] [i2]   [-1]
[0 1] [j1] + [0]  = [0 1] [j2] + [ 0]
We get j1=j2 and i2 = i1 + 1. Notice however that the reuse distance is very high (n) because i is not the innermost loop index. The reuse helps if the entire row (n) fits in cache. Relative significance of types of reuse

Scalar/Loop Pass Interaction

Example:
for (i = 0; i < n; i++) {
  for (j = 0; j < m; j++) {
    A[i] += i*B[j];
  }
}
After loop invariant code motion (LICM); in particular, register promotion, this becomes:
for (i = 0; i < n; i++) {
  reg = A[i];
  for (j = 0; j < m; j++) {
    reg += i*B[j];
  }
  A[i] = reg;
}

However, loop interchange transformation cannot be applied to the latter (while it could be applied to the former). If we apply loop interchange transformation to the former followed by global value numbering (GVN), in particular load partial redundancy elimination (loadPRE), we get:

for (j = 0; j < m; j++) {
  for (i = 0; i < n; i++) {
    A[i] += i*B[j];
  }
}
Performing GVN(LoadPRE) on this, we get
for (j = 0; j < m; j++) {
  tmp = B[j]
  for (i = 0; i < n; i++) {
    A[i] += i*tmp;
  }
}