COL729 Lab 1 : Anticipated (Needed) Expressions Analysis

Due date - 23:55, 10 March 2023

Weight: 40%

Overview

In this assignment, you will implement Anticipated (Needed) Expressions Analysis (notes, video). Make sure you understand the analysis before proceeding ahead.

In order to implement a data-flow analysis, (broadly) two things are required: (1) a framework to represent the target program as a flow graph and, (2) a DFA frawework within which the schema can be implemented.

The CompilerAI framework -- developed in-house here at IITD -- provides both, and you will be implementing your analyses in this framework for this and future labs. The next subsection gives a (very) brief conceptual overview of the framework.

CompilerAI framework

TFG representation

A program is represented as Control Flow Graph (CFG) and is referred to as TFG (Transfer Function Graph) in the code.

Constituents of a TFG

A TFG is associated with a set of nodes and a set of edges.

A node of the TFG represents a program point and is referred to as a PC in the code. A PC is uniquely identified by its label, which can be serialized as a string, e.g., Lfor.cond%1%1.
More on PC
A PC label is a tuple of three components: a block-ID (AKA index), a statement-ID (AKA subindex), and a sub-statement-ID (AKA subsubindex). The string represenation of label is obtained by concatenating the three components, separated by a % and prefixed by L. For example, the string representation of PC label (for.cond,1,1) is "Lfor.cond%1%1".

Further, a PC label has a useful interpretation when the source program for the TFG is an LLVM program. In this interpretation, the block-ID/index corresponds to an LLVM basic block ID, the statement-ID/subindex corresponds to the basic block's statement index, and sub-statement/subindex corresponds to "sub-statement" index.
```
       +----  ----> for.cond :  Block-ID         -- corresponds to an LLVM basic block
       |      +---> 1        :  Statement-ID     -- corresponds first statement of a basic block
       |      | +-> 1        :  Sub-statement-ID -- corresponds first "sub-statement" of a statement
       |      | |
  Lfor.cond%  1%1
              
```
This interpretation allows a reverse mapping to source program (i.e., LLVM) location. In the above example, Lfor.cond%1%1 refers to the location just before the first statement of the LLVM basic block for.cond (note that in the textual representation of LLVM IR a % is pre-prepended before the basic block label, i.e. for.cond will be referred to as %for.cond).
An edge of the TFG represents a guarded basic block with three* components: a <from-PC, to-PC> pair, a guard (AKA edge condition or edgecond), and a to-state mapping. The <from-PC, to-PC> pair identifies the nodes which are connected by the edge, the guard encodes the boolean condition under which the edge shall be taken, and the to-state mapping encodes the state transformation corresponding to the basic block which the edge is representing.
The to-state mapping is a map (function) from state to state where the state is the set of all variables used in the program, including the implicit memory element (the members of the state are referred to as state elements). The input state is the state corresponding to the from-PC of the program, and the output state is the state corresponding to the to-PC of the program. The to-state, thus, describes a set of assignments, or, in other words, how the program state elements are modified when the edge is taken.

Delta mapping
In the to-state map, we only maintain non-identity mappings. The transformation for the missing entries are defined to be identity.
For example, if the map has two entries { x ↦ y, p ↦ q } and the program state is { x, y, z, p, q, r, s} then the "missing" mappings for variables { y, z, q, r, s} are defined to be identity: the full mapping thus is { x ↦ y, y ↦ y, z ↦ z, p ↦ q, q ↦ q, r ↦ r, s ↦ s }.

An example
The C statement, if (i >= 0) ret += i; can be encoded as two TFG edges:
Edge 1:
1. from-PC: "LX%1%1"
2. to-PC: "LX%3%1"
3. edgecond: (i >= 0)
4. to-state: { ret ↦ (ret + i) }
Edge 2:
1. from-PC: "LX%1%1"
2. to-PC: "LX%3%1"
3. edgecond: !( i >= 0)
4. to-state: { }
Explanation:
- Both edges start at the same location and end at the same location and have edge conditions that are complements of each other, thus ensuring that they cover all possible control-flow from the from-PC to the to-PC.
- Edge 1, with the edge condition i >= 0 has a non-empty to-state corresponding to the write of value ret+i to the variable ret.
  - Note that there are two different ret here: one corresponding to the from-PC location and the other corresponding to the to-PC location.
- Edge 2, with the complementary edge condition !(i >= 0), corresponds to the control flow when i is not greater-than or equal to 0 and where ret remains unmodified.
On sequencing and order of evaluation in an edge
It is important to note that there is no sequencing in the to-state of a TFG edge; every assignment is assumed to happen simultaneously (the notion of sequencing comes from combination of edges).
In the simplest case you would have one TFG edge for each program statement (for example, one TFG edge per LLVM statement) and sequencing would look normal. However, you can also have a single TFG edge for a maximal basic block (i.e., a basic block which has multiple input edges and multiple outgoing edges) and in that case it might be somewhat difficult to imagine what the to-state would look like.
Consider the following sequence of C statements:
```
x = y+z;      #1
q = x+1;      #2
```
The to-state of the TFG edge which represents the above sequence will look like:
```
x ↦ y + z
q ↦ ((y + z) + 1);
```
Note that the y, z on the RHS of ↦ refer to the state elements corresponding to the from-PC of the edge. To obtain the mapping for q in terms of from-PC state elements, we substitute x with the RHS of = in #1. Because the mappings are expressed in terms of from-PC elements, there is no need for sequencing and the assignments can happen in any order (or even simultaneously).

A C to TFG transformation example

C code

int sum(int n)
{
  int ret = 0;
  for (int i = 0; i < n; ++i) {
    ret += i;
  }
  return ret;
}

LLVM code

define i32 @sum(i32 %n) {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %ret.0 = phi i32 [ 0, %entry ], [ %add, %for.inc ]
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %cmp = icmp slt i32 %i.0, %n
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %add = add nsw i32 %ret.0, %i.0
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %inc = add nsw i32 %i.0, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret i32 %ret.0
}

TFG (abbreviated)

=Input:
	%n	

=Output:
	%ret-reg	

=Nodes:

	"L0%0%d"
	"Lfor.cond%1%bbentry"
	"Lfor.cond%2%d"
	"Lfor.body%1%d"
	"Lfor.end%1%d"
	"E0%0%d"

=Edges:

	"L0%0%d" => "Lfor.cond%1%bbentry"
		Guard: true
		To-State:
			%i.0 ↦ 0
			%ret.0 ↦ 0

	"Lfor.cond%1%bbentry" => "Lfor.cond%2%d"
		Guard: true
		To-State:
			%cmp ↦ %i.0 < %n


	"Lfor.cond%2%d" => "Lfor.body%1%d"
		Guard: %cmp
		To-State:

	"Lfor.body%1%d" => "Lfor.cond%1%bbentry"
		Guard: true
		To-State:
			%i.0 ↦ %i.0 + 1
			%ret.0 ↦ %ret.0 + %i.0

	"Lfor.cond%2%d" => "Lfor.end%1%d"
		Guard: !(%cmp)
		To-State:

	"Lfor.end%1%d" => "E0%0%d"
		Guard: true
		To-State:
			%ret-reg ↦ %ret.0

Graphical representation of edges

Instructions

The source code for the framework is available in the form of a Git repository in the server machine provided to you for lab work. You will create a local copy of it and submit your changes as a patch against the reference code.

Getting started

Follow the steps below to get started.

Clone the Git repo in your home directory: git clone --recursive /home2/col729-common/superopt-project
Change directory to superopt-project.
Link-in the tars directory: ln -s /home2/col729-common/tars ..
Build the project: make.
If you get any issues in this step, make a post on Teams with full output of make.

At this point, the initial setup is complete.

Change directory to superopt. This is the top-level directory under which you will make your changes. Henceforth it is assumed that you are already in the root of superopt. The primary places where you need to make changes have been marked with COL729_LAB1 TODO. Search for them using git grep 'COL729_LAB1 TODO'.

Working with the code

TFG

Recall that a program is represented using TFG. Conceptually, a TFG is a graph with additional features (like input/output). The code follows this conceptual structure where a TFG, represented by tfg class, has an is-a relationship with a graph, represented by the graph class. The graph class is mainly concerned with "graph" properties: edges and nodes, and the tfg class and its more specialized cases, such as tfg_llvm_t represent a program (in the present case, a C function) created from LLVM IR (which itself is created from a C program). The relevant source code paths for "graph" and "TFG" related functionality are include/graph/ and include/tfg/, respectively.

DFA

The DFA framework provides a parameterized implementation of the worklist algorithm. The parameterization allows the user to reuse the same base code for implementing all data-flow analyses.

To implement an analysis in the DFA framework, one needs to subclass the data_flow_analysis and provide an overriding implementation for the xfer_and_meet() function, which accepts three parameters: (1) the corresponding edge, (2) input value (i.e. IN), (3) output value (i.e. OUT) passed as a reference. The function is supposed to compute the new OUT value using the IN value and the old OUT value (by using meet() for the latter) and indicate as the return value whether there was a change in the OUT value. The DFA framework is defined in include/graph/dfa.h. Some already existing analyses can be found in files include/graph/var_liveness_dfa.h and include/tfg/liveness_dfa.h.

Studying the implementation of xfer_and_meet of existing analyses would give you enough information to implement the anticipated expressions analysis. You may also find the generic loc (see following subsection) "DFA value" class definition (defined in include/graph/graph_per_loc_dfa_val.h) helpful. Most things remain the same in the analyses: you need the expressions generated and killed in each edge (see populate_gen_and_kill_sets_for_edge()), you need to kill those expressions which depend on a killed expression (see graph_get_locs_potentially_read_in_expr()) and so on.

Expressions and Locs

Expressions (`expr`)

We represent an (AST) expression using the expr class (see include/expr/expr.h). expr is a sum type (also known as union type) representing either a variable, a constant, or an operation over sub-expression(s), which are also exprs themselves. Each expr is associated with a sort representing the sort (or type) of the value the expr is holding.

As an expr is essentially a DAG (directed acyclic graph) of values (which could be exprs), it is serialized as a DAG. The serialization format is a sequence of statements where each statement is an expr potentially referring to an earlier expr:

      EXPR_ID : NAME or VALUE or OPERATION : SORT

Here, EXPR_ID is an integer identifier that can be used to refer to this expr, NAME is a variable name, VALUE is a constant value, OPERATION is an operation (operator applied to (references to) earlier expressions), and SORT is, well, the sort of the expr. Some examples:

a variable of bit-vector of width 32 sort:
```
1 : input.src.llvm-%n : BV:32
```
a constant of boolean sort:
```
1 : 0 : BOOL
```

a bit-vector add operation:

1 : input.src.llvm-%a : BV:32
2 : input.src.llvm-%b : BV:32
3 : bvadd(1, 2) : BV:32

expr_ref: In the code, you mostly see references to expr_ref instead of expr. It is because an expr_ref is a reference (pointer) to an immutable expr object which is reference-counted to simplify memory management.

Locs (`graph_loc_id_t`)

Conceptually, a loc represents a memory location that can have an associated expression (expr). Thus, a loc is one of the following:

an LLVM variable,
a register,
a memory region (a masked/selected region of memory, AKA a memmask),
a unit of memory whose size and location are known (a memory slot, AKA a slot).

The framework reasons about memory locations using locs. For example, a write to an LLVM variable would modify the loc corresponding to that variable, a write to a memory location would modify all memory slots whose address can potentially alias with the target of memory write, and so on.

It is possible to get the expr_ref associated with a loc (graph_loc_id_t in code) and vice-versa using two using helper functions: graph_loc_get_expr() and graph_get_locs_potentially_read_in_expr().

Some useful functions, such as populate_gen_and_kill_sets_for_edge(), return locs instead of expressions.

Skeleton code

In order to simplify things and allow you to focus on the analysis-relevant parts, some skeleton code is provided. In particular:

A class representing anticipated expressions as a set of expr_ref is defined at include/graph/anticipated_exprs.h.
A class for anticipated expressions DFA value is defined at include/graph/anticipated_exprs_dfa.h. You are expected to fill-in definitions of meet and top here (see include/graph/avail_exprs_dfa.h for a sample).
A class for defining the DFA schema for anticipated expressions analysis is defined at include/graph/anticipated_exprs_dfa.h. You are expected to fill-in the definition of xfer_and_meet and make other changes, as required, here.

Running and testing your changes

Once you have filled-in the required code, you need to uncomment the driver code in compute_anticipated_exprs() function of lib/graph/graph_with_locs.cpp to run the anticipated expressions analysis. Remember to delete the return {}; line.

Steps for building and running the code:

Run make in superopt-project to build your changes.
Now switch the directory to lab1_test and run ./c2bc.sh a.c to compile C file a.c to LLVM bitcode a.c.bc.
While still in lab1_test, run ./run.sh a.c.bc to convert the LLVM bitcode to TFG and run the analysis. The output TFG file name will be named ${INPUT_FILENAME}.etfg, e.g., a.c.bc.etfg for a.c.bc. The log (stdout) of run is automatically saved in last_run.log.
You can turn on debug output by appending --dyn_debug=DEBUG_CLASS to llvm2tfg command in run.sh. DYN_DEBUG statements in your code will get executed if their DEBUG_CLASS matches the supplied DEBUG_CLASS. For example, in order to execute this statement:
```
DYN_DEBUG(anticipated_analysis, cout << "Here" << endl);
```
you will supply --dyn_debug=anticipated_analysis to llvm2tfg.

Visual representation of TFG

For debugging, you may also find the to_dot.sh tool (present in lab1_test/to_dot.sh), helpful. It makes a Graphviz DOT file out of a TFG. The created DOT file can be visualized using xdot or a similar DOT viewer to see the graphical structure of the TFG. The tool accepts two parameters: function name and TFG file name, where the function name is the function's name for which graphviz DOT graph is to be created. The output file name is named as "<TFG file name>.<function name>.dot". Example: ./to_dot.sh sum a.c.bc.etfg will create a.c.bc.etfg.sum.dot.

FWIW, the graph in the example above was created using this tool.

Interpreting the results

The result of your anticipated expressions analysis will be available in serialized format in the output TFG file: search for text =Anticipated expressions in. Example:

=Anticipated expressions in src.llvm.sum
=anticipated expressions at L0%0%1 in src.llvm.sum
=anticipated_exprs begin
=expr
1 : 0 : BV:32
2 : input.src.llvm-%n : BV:32
3 : bvslt(1, 2) : BOOL
=expr
=anticipated_exprs end
=anticipated expressions at Lentry%1%1 in src.llvm.sum
=anticipated_exprs begin
...
...
...
=anticipated_exprs end
=anticipated expressions at Lfor.cond%1%1 in src.llvm.sum
...
...
...

The above example represents the following information: at PC L0%0%1, the expression bvslt(0, input.src.llvm-%n), i.e. (0 < %n) is anticipated (we can probably guess here that the first statement of the corresponding C code has a check with n > 0 in it, where n is a parameter).

Requirements

Your implementation should be able to find all anticipated expressions at every PC in the input program. As the framework can transparently handle things like aliasing and function calls, your implementation should not assume the absence of any program features.

Turn in

Submit your changes in the form of a compressed patch file:

git diff col729_lab1 | gzip > ${your_group_members_entry_nums}.gz
Submit ${your_group_members_entry_nums}.gz to moodle

NOTE: It is your responsibility to ensure that a sequence of gunzip -c ${your_group_members_entry_nums}.gz | patch -p1, when executed in superopt-project root with HEAD at col728_lab1, produces a tree equivalent to yours.

NOTE: Please start early in this lab. The documentation is almost non-existent, and the code could be more beginner-friendly. It may take some time to get used to it and be able to make any changes.