Reuse is temporal if exact same data is accessed, and spatial if the same cache-line is accessed.
Example
float Z[n] ; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { Z[j+1] = (Z[j] + Z[j+1] + Z[j+2])/3; } }Here, each of Z[j], Z[j + 1], and Z[j + 2] have self spatial-reuse. They also have self temporal-reuse, because the exact same elements are re-used over and over again (once for each iteration of the outer loop).
We discuss how to use linear algebra to extract this reuse.
If the data referenced by a static access has k dimensions and the access is nested in a d-depth loop nest, for some d > k, then the same data can be reused n^(d-k) times, where n is the number of iterations in each loop. The dimensionality of an access corresponds to the rank of the coefficient matrix of the access, and we can find which iterations refer to the same location by finding the null space of the matrix.
Fi + f = Fi' + fHence,
F(i - i') = 0Thus two iterations refer to the same element if their difference belongs to the null space of F. Trivial solution is i=i' (not meaningful); we are interested in other non-zero solutions. Discuss fully-ranked matrices.
Trick: drop the last row of the co-efficient matrix F; if the rank of the resulting truncated matrix is less than the depth of the loop nest, then we can assure spatial locality by making sure that the innermost loop varies only the last coordinate of the array.
Example: A[3,2*i,7*i+j]; if we truncate the last dimension, the rank of the truncated coefficient matrix F is only 1, and hence there is an opportunity for spatial reuse. In this case, if the inner loop iterates on j, then we exploit spatial reuse (but not if the inner loop iterates on i)
General rule for spatial re-use: In order for there to be spatial reuse, the vector [0,0,.. . , 0,1] must be in the null space of the truncated matrix. The reason is that if this vector is in the null space, then when we fix all loop indexes but the innermost one, we know that all dynamic accesses during one run through the inner loop vary in only the last array index. If the array is stored in row-major order, then these elements are all near one another, perhaps in the same cache line.
Given two accesses <F,f1> and <F,f2>, reuse of the same data requires
F*i1 + f1 = F*i2 + f2or
F(i1-i2) = f2-f1Suppose v is one solution to the equation, then all solutions of this equation can be represented as v+w where w is in the null-space of F.
Example:
for (i = 1; i <= n; i++) { for (j = 1; j <= n; j++) { Z[i,j] = Z[i-1,j]; } }has two array accesses both of which have the same coefficient matrix
[1 0] [0 1]Each access exhibits self-spatial reuse --- if we erase the bottom row, we get [1 0], and [0 1] is in the nullspace of this matrix. There is no self-temporal reuse. But there is group temporal reuse; formally, discounting the loop bounds, there is reuse if
[1 0] [i1] [0] [1 0] [i2] [-1] [0 1] [j1] + [0] = [0 1] [j2] + [ 0]We get
j1=j2
and i2 = i1 + 1
. Notice however that the reuse distance is very high (n) because i
is not the innermost loop index. The reuse helps if the entire row (n) fits in cache.
Relative significance of types of reuse
for (i = 0; i < n; i++) { for (j = 0; j < m; j++) { A[i] += i*B[j]; } }After loop invariant code motion (LICM); in particular, register promotion, this becomes:
for (i = 0; i < n; i++) { reg = A[i]; for (j = 0; j < m; j++) { reg += i*B[j]; } A[i] = reg; }
However, loop interchange transformation cannot be applied to the latter (while it could be applied to the former). If we apply loop interchange transformation to the former followed by global value numbering (GVN), in particular load partial redundancy elimination (loadPRE), we get:
for (j = 0; j < m; j++) { for (i = 0; i < n; i++) { A[i] += i*B[j]; } }Performing GVN(LoadPRE) on this, we get
for (j = 0; j < m; j++) { tmp = B[j] for (i = 0; i < n; i++) { A[i] += i*tmp; } }