Need for Locality

The need for locality of access:

Consider the loop

for (j=1; j<10; j++) {
  for (i=1; i<10000000; i++) {
    a[i] *= b[i];
  }
}
By the time the second outer-loop iteration takes place, the cache does not contain the needed values of the first outer-loop iteration. In other words, the reuse distance is very high. Change this to the following code with much lower reuse distance:
for (i=1; i<10000000; i++) {
  for (j=1; j<10; j++) {
    a[i] *= b[i];
  }
}
Can be up to 10x faster than the original loop. This optimization is called loop interchange.

Matrix-Vector multiplication

for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++) {
    y[i] = A[i,j]*x[j]
  }
}
Would work well if A is stored in row-major format. Would work badly if A is stored in column-major format. If the compiler has a choice, it should decide the data layout of A wisely. Some programming languages (like Matlab) allow you this choice. Others (like C where a 2D array is an array of arrays) do not. Even if this choice is not available in the programming language, if the global use of the 2-D array can be captured fully by the compiler, it can switch the indices of the global variable (e.g., A[100][2] to A[2][100]).