Some examples to practice PDGs and sync-free parallelism

for (i = 1; i < n i++) {
  for (j = 1; j < n; j++) {
    X[i, j] = f(X[i, j] + X[i - 1, j]); //s1
  }
}
for (i = 0; i < n; i++) {
  for (j = 1; j < n; j++) {
    X[i, j] = g(X[i, j] + X[i, j-1]); //s2
  }
}

In this example, when we draw the PDG, we get edges from s1-to-s1, s2-to-s2, and s1-to-s2. Thus, we can insert a barrier between the first loop nest and the second loop nest. Further, when we apply our synchronization-free algorithm to each loop nest, we get

s1: p = j
s2: p = i

which parallelizes both loops appropriately

Now, let's consider the case when the second loop nest accesses Y instead of X:

for (i = 1; i < n; i++) {
  for (j = 1; j < n; j++) {
    X[i, j] = f(X[i, j] + X[i - 1, j]); //s1
  }
}
for (i = 0; i < n; i++) {
  for (j = 1; j < n; j++) {
    Y[i, j] = g(Y[i, j] + Y[i, j-1]); //s2
  }
}

In this case, simply applying the sync-free parallelism algorithm (without constructing the PDG) would result in the following solution

s1: p = j
s2: p = i

and the following generated code:

for (p = 1; p < n; p++) {
  for (i = 1; i < n; i++) {
    X[i, j] = f(X[i, j] + X[i - 1, j]); //s1
  }
  for (j = 1; j < n; j++) {
    Y[i, j] = g(Y[i, j] + Y[i, j-1]); //s2
  }
}

Comment: actually, we could have got 2n parallelism (instead of n parallelism as in the above solution), by spawning separate threads for both s1 and s2. Notice that our formulation only works towards finding the maximum rank solution for the processor space and does not worry about constants; an ideal algorithm should have probably found the following mapping:

s1: p = j
s2: p = n + i

We currently do not worry about finding such solutions, as usually we expect n to be much larger than the number of processors.

Another example:

for (i = 0; i < n; i++) {
  Z[i] = Z[i] / W[i]; //s1
  for (j = i; j < n; j++) {
    X[i,j] = Y[i,j] * Y[i,j]; //s2
    Z[j] = Z[j] + X[i,j];  //s3
  }
}

In this example, we will also have self edges s3-to-s3 (because there are dependencies for different values of i and same values of j), in addition to the dependencies specified in the book. After splitting the instances based on PDG, we get a separate loop nest for s2 and a separate loop nest for s1 and s3. The loop nest for s1 is fully parallelizable; the loop nest for s1 and s3 looks like the following:

for (i = 0; i < n; i++) {
  Z[i] = Z[i] / W[i]; //s1
  for (j = i; j < n; j++) {
    Z[j] = Z[j] + X[i,j];  //s3
  }
}

However, if we consider dependencies across s1 and s3, and s3 with itself, we get that the (i,0)th instance of s1 must execute on the same processor as the (k,i)th instance of s3 (for any k). This gives the following processor assignment for s1 and s3:

s1: p = i
s3: p = j

Generating code, we get:

for (p = 0; p < n; p++) {
  for (i = 0; i < n; i++) {
    if (p == i) Z[i] = Z[i] / W[i]; //s1
    for (j = i; j < n; j++) {
      if (p == j) Z[j] = Z[j] + X[i,j];  //s3
    }
  }
}

Eliminating empty i iterations for s1 and s3 through Fourier-Motzkin elimination, we get the following ranges for s1 and s3:

s1: i >= p, i <= p
s3: i >= 0, i <= p

Based on this, we get two cases:

i >= 0, i <= p - 1 : only s3 executes
i >= p, i <= p: both s1 and s3 execute

If we generate the code (by pasting the original code and removing the non-executing statements), we get:

for (p = 0; p < n; p++) {
  for (i = 0; i < p; i++) {
    for (j = i; j < n; j++) {
      if (p == j) Z[j] = Z[j] + X[i,j];  //s3
    }
  }
  Z[p] = Z[p] / W[p]; //s1
  for (j = p; j < n; j++) {
    if (p == j) Z[j] = Z[j] + X[p,j];  //s3
  }
}

Now applying Fourier-Motzkin on the two inner loops on j, we get:

for (p = 0; p < n; p++) {
  for (i = 0; i < p; i++) {
    Z[p] = Z[p] + X[i,p];  //s3
  }
  Z[p] = Z[p] / W[p]; //s1
  Z[p] = Z[p] + X[p,p];  //s3
}