for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { //reads C(i,j) (can be register-allocated in the inner-most loop if we know that C, A, and B cannot alias) for (k = 0; k < n; k++) { //reads row i, column k of A into cache. cache hit (i,k-1) was read in previous iteration which was adjacent in memory. //reads row k, column j of B into cache. CACHE MISS (k-1,j) is far away from (k,j) giving little spatial locality C[i,j] = C[i,j] + A[i,k]*B[k,j] } } }Number of memory accesses (cache misses) = n^3 (due to B)
Blocked (tiled) matrix multiply. Consider A, B, C to be NxX matrices of bxb sub-blocks where b=n/N is the block-size.
for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { //reads block at C(i,j) into cache. Likely to have O(b) misses; one for each row in the block for (k = 0; k < N; k++) { //reads row i of block at A(i,k) into cache. Likely to have O(b) misses; one for each row in the block. //reads column j of block at B(k,j) into cache. Likely to have O(b) misses (one for each row in the block). Notice that the access pattern is still column-major, but the memory-load pattern can be row-major. BLOCK_MULTIPLY(C[i,j], A[i,k], B[k,j], b) } } }Number of memory accesses: 2 * N^3 * n/N * n/N (read each block of A and B N^3 times) + 2 * N^2 * n/N *n/N (read and write each block of C once in the second nested loop N^2 times). This transformation is called loop tiling.
The improvement = n^3/N*n^2 = n/N = b
. In general, increasing b
sounds like a good idea, but only until all three arrays can fit in the cache. Thus if the cache-size is M
, then the maximum block-size we can have is sqrt(M/3)
(which is also the maximum speedup we can have).
for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { A[i,j] = 0; for (k = 0; k < n; k++) { A[i,j] += X[i,k]*Y[k,j]; } } }Serial execution of matrix multiplication:
c
is the umber of cache lines in the cache, and if n columns of Y can survive in cache simultaneously, one iteration of the outer-loop (e.g., i = 0), takes n^2/c cache misses (n^2 elements in Y and spatial locality allows a factor of c).n/p
rows of A to one processor. With this: each processor needs to access n/p
rows of A and X, but the entire Y. Each processor will compute n^2/p elements of A performing n^3/p multiply-and-add operations to do so.
While computation time decreases in proportion to p
, the communication cost (total number of cache misses) rises in proportion to p
, because Y needs to be loaded in entirety in each processor's cache. The total number of cache misses is n^2/c + p*n^2/c.
for (ii = 0; ii < n; ii+=B) { for (jj = 0; jj < n; jj+=B) { for (kk = 0; kk < n; kk+=B) { for (i = ii; i < ii+B; i++) { for (j = jj; j < jj+B; j++) { for (k = kk; k < kk+B; k++) { A[i,j] += X[i,k]*Y[k,j]; } } } } } }We have re-ordered the iteration space significantly, but because the order of iterations does not matter (addition is commutative and associative), this remains correct.
To bring a block of X or Y into the cache, we require B^2/c cache misses. The number of times we need to bring a block into the cache is (n/B)^3. Total number of cache misses = 2n^3/Bc. If B=n, we get 2n^2/c cache accesses; on the other hand if B=100 s.t. all three blocks fit in cache, we can get a significant speedup of O(Bc) [compared to worst-case O(n^3) cache misses].
This blocking can be done at each level of the cache hierarchy with different Bs. e.g., B=8 for L1, 80 for L2, ...
In practice: blocking can improve performance by a factor of 3 on modern machines on a uniprocessor (compute-bound after a point; recall Amdahl's law); for multiprocessors, where a set of blocks are farmed out to each processor, the speedup can be linear in the number of processors.