### Affine Transformations for Optimizing Parallelism and Locality

The need for locality of access:
• Registers : 1 cycle : 256-8000 bytes
• Cache(L1,L2,L3) : 1-6 cycles : 256K-1M bytes
• Main memory: 20-100 cycles : 32M-256G bytes
• Disk: 0.5-5M cycles : 1-100Tbytes

Consider the loop

```for (j=1; j<10; j++) {
for (i=1; i<10000000; i++) {
a[i] *= b[i];
}
}
```
By the time the second outer-loop iteration takes place, the cache does not contain the needed values of the first outer-loop iteration. In other words, the reuse distance is very high. Change this to the following code with much lower reuse distance:
```for (i=1; i<10000000; i++) {
for (j=1; j<10; j++) {
a[i] *= b[i];
}
}
```
Can be up to 10x faster than the original loop. This optimization is called loop interchange.

### Matrix-Vector multiplication

```for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
y[i] = A[i,j]*x[j]
}
}
```
Would work well if `A` is stored in row-major format. Would work badly if `A` is stored in column-major format. If the compiler has a choice, it should decide the data layout of `A` wisely. Some programming languages (like Matlab) allow you this choice. Others (like C where a 2D array is an array of arrays) do not. Even if this choice is not available in the programming language, if the global use of the 2-D array can be captured fully by the compiler, it can switch the indices of the global variable (e.g., A to A).

### Matrix-Matrix multiplication

Naive implementation
```for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
//reads C(i,j) (can be register-allocated in the inner-most loop if we know that C, A, and B cannot alias)
for (k = 0; k < n; k++) {
//reads row i, column k of A into cache. cache hit (i,k-1) was read in previous iteration which was adjacent in memory.
//reads row k, column j of B into cache. CACHE MISS (k-1,j) is far away from (k,j) giving little spatial locality
C[i,j] = C[i,j] + A[i,k]*B[k,j]
}
}
}
```
Number of memory accesses (cache misses) = n^3 (due to B)

Blocked (tiled) matrix multiply. Consider A, B, C to be NxX matrices of bxb sub-blocks where b=n/N is the block-size.

```for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
//reads block at C(i,j) into cache. Likely to have O(b) misses; one for each row in the block
for (k = 0; k < N; k++) {
//reads row i of block at A(i,k) into cache. Likely to have O(b) misses; one for each row in the block.
//reads column j of block at B(k,j) into cache. Likely to have O(b) misses (one for each row in the block). Notice that the access pattern is still column-major, but the memory-load pattern can be row-major.
BLOCK_MULTIPLY(C[i,j], A[i,k], B[k,j], b)
}
}
}
```
Number of memory accesses: 2 * N^3 * n/N * n/N (read each block of A and B N^3 times) + 2 * N^2 * n/N *n/N (read and write each block of C once in the second nested loop N^2 times). This transformation is called loop tiling.

The improvement = `n^3/N*n^2 = n/N = b`. In general, increasing `b` sounds like a good idea, but only until all three arrays can fit in the cache. Thus if the cache-size is `M`, then the maximum block-size we can have is `sqrt(M/3)` (which is also the maximum speedup we can have).