for (i = 1; i < n i++) { for (j = 1; j < n; j++) { X[i, j] = f(X[i, j] + X[i - 1, j]); //s1 } } for (i = 0; i < n; i++) { for (j = 1; j < n; j++) { X[i, j] = g(X[i, j] + X[i, j-1]); //s2 } }In this example, when we draw the PDG, we get edges from s1-to-s1, s2-to-s2, and s1-to-s2. Thus, we can insert a barrier between the first loop nest and the second loop nest. Further, when we apply our synchronization-free algorithm to each loop nest, we get
s1: p = j s2: p = iwhich parallelizes both loops appropriately
Now, let's consider the case when the second loop nest accesses Y instead of X:
for (i = 1; i < n; i++) { for (j = 1; j < n; j++) { X[i, j] = f(X[i, j] + X[i - 1, j]); //s1 } } for (i = 0; i < n; i++) { for (j = 1; j < n; j++) { Y[i, j] = g(Y[i, j] + Y[i, j-1]); //s2 } }In this case, simply applying the sync-free parallelism algorithm (without constructing the PDG) would result in the following solution
s1: p = j s2: p = iand the following generated code:
for (p = 1; p < n; p++) { for (i = 1; i < n; i++) { X[i, j] = f(X[i, j] + X[i - 1, j]); //s1 } for (j = 1; j < n; j++) { Y[i, j] = g(Y[i, j] + Y[i, j-1]); //s2 } }Comment: actually, we could have got
2n
parallelism (instead of n
parallelism
as in the above solution), by spawning separate threads for both s1 and
s2. Notice that our formulation only works towards finding the maximum
rank solution for the processor space and does not worry about constants;
an ideal algorithm should have probably found the following mapping:
s1: p = j s2: p = n + iWe currently do not worry about finding such solutions, as usually we expect
n
to be much larger than the number of processors.
Another example:
for (i = 0; i < n; i++) { Z[i] = Z[i] / W[i]; //s1 for (j = i; j < n; j++) { X[i,j] = Y[i,j] * Y[i,j]; //s2 Z[j] = Z[j] + X[i,j]; //s3 } }In this example, we will also have self edges s3-to-s3 (because there are dependencies for different values of
i
and same values
of j
), in addition to the dependencies specified in the book.
After splitting the instances based on PDG, we get a separate loop nest
for s2
and a separate loop nest for s1
and s3
. The loop nest for s1
is fully parallelizable;
the loop nest for s1
and s3
looks like the following:
for (i = 0; i < n; i++) { Z[i] = Z[i] / W[i]; //s1 for (j = i; j < n; j++) { Z[j] = Z[j] + X[i,j]; //s3 } }However, if we consider dependencies across s1 and s3, and s3 with itself, we get that the
(i,0)th
instance of s1 must execute on the same
processor as the
(k,i)th
instance of s3 (for any k). This gives the following
processor assignment for s1 and s3:
s1: p = i s3: p = jGenerating code, we get:
for (p = 0; p < n; p++) { for (i = 0; i < n; i++) { if (p == i) Z[i] = Z[i] / W[i]; //s1 for (j = i; j < n; j++) { if (p == j) Z[j] = Z[j] + X[i,j]; //s3 } } }Eliminating empty
i
iterations for s1 and s3 through Fourier-Motzkin elimination, we get
the following ranges for s1 and s3:
s1: i >= p, i <= p s3: i >= 0, i <= pBased on this, we get two cases:
i >= 0, i <= p - 1 : only s3 executes i >= p, i <= p: both s1 and s3 executeIf we generate the code (by pasting the original code and removing the non-executing statements), we get:
for (p = 0; p < n; p++) { for (i = 0; i < p; i++) { for (j = i; j < n; j++) { if (p == j) Z[j] = Z[j] + X[i,j]; //s3 } } Z[p] = Z[p] / W[p]; //s1 for (j = p; j < n; j++) { if (p == j) Z[j] = Z[j] + X[p,j]; //s3 } }Now applying Fourier-Motzkin on the two inner loops on
j
, we get:
for (p = 0; p < n; p++) { for (i = 0; i < p; i++) { Z[p] = Z[p] + X[i,p]; //s3 } Z[p] = Z[p] / W[p]; //s1 Z[p] = Z[p] + X[p,p]; //s3 }