COL729 Lab 1 : Anticipated (Needed) Expressions Analysis

Due date - 23:55, 10 March 2023

Weight: 40%

Overview

In this assignment, you will implement Anticipated (Needed) Expressions Analysis (notes, video). Make sure you understand the analysis before proceeding ahead.

In order to implement a data-flow analysis, (broadly) two things are required: (1) a framework to represent the target program as a flow graph and, (2) a DFA frawework within which the schema can be implemented.

The CompilerAI framework -- developed in-house here at IITD -- provides both, and you will be implementing your analyses in this framework for this and future labs. The next subsection gives a (very) brief conceptual overview of the framework.

CompilerAI framework

TFG representation

A program is represented as Control Flow Graph (CFG) and is referred to as TFG (Transfer Function Graph) in the code.

Constituents of a TFG
A TFG is associated with a set of nodes and a set of edges.
A C to TFG transformation example C code
int sum(int n)
{
  int ret = 0;
  for (int i = 0; i < n; ++i) {
    ret += i;
  }
  return ret;
}
LLVM code
define i32 @sum(i32 %n) {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %ret.0 = phi i32 [ 0, %entry ], [ %add, %for.inc ]
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %cmp = icmp slt i32 %i.0, %n
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %add = add nsw i32 %ret.0, %i.0
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %inc = add nsw i32 %i.0, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret i32 %ret.0
}
TFG (abbreviated)
=Input:
	%n	

=Output:
	%ret-reg	

=Nodes:

	"L0%0%d"
	"Lfor.cond%1%bbentry"
	"Lfor.cond%2%d"
	"Lfor.body%1%d"
	"Lfor.end%1%d"
	"E0%0%d"

=Edges:

	"L0%0%d" => "Lfor.cond%1%bbentry"
		Guard: true
		To-State:
			%i.0 ↦ 0
			%ret.0 ↦ 0

	"Lfor.cond%1%bbentry" => "Lfor.cond%2%d"
		Guard: true
		To-State:
			%cmp ↦ %i.0 < %n


	"Lfor.cond%2%d" => "Lfor.body%1%d"
		Guard: %cmp
		To-State:

	"Lfor.body%1%d" => "Lfor.cond%1%bbentry"
		Guard: true
		To-State:
			%i.0 ↦ %i.0 + 1
			%ret.0 ↦ %ret.0 + %i.0

	"Lfor.cond%2%d" => "Lfor.end%1%d"
		Guard: !(%cmp)
		To-State:

	"Lfor.end%1%d" => "E0%0%d"
		Guard: true
		To-State:
			%ret-reg ↦ %ret.0
Graphical representation of edges

Instructions

The source code for the framework is available in the form of a Git repository in the server machine provided to you for lab work. You will create a local copy of it and submit your changes as a patch against the reference code.

Getting started

Follow the steps below to get started.

  1. Clone the Git repo in your home directory: git clone --recursive /home2/col729-common/superopt-project
  2. Change directory to superopt-project.
  3. Link-in the tars directory: ln -s /home2/col729-common/tars ..
  4. Build the project: make.
    If you get any issues in this step, make a post on Teams with full output of make.

At this point, the initial setup is complete.

Change directory to superopt. This is the top-level directory under which you will make your changes. Henceforth it is assumed that you are already in the root of superopt. The primary places where you need to make changes have been marked with COL729_LAB1 TODO. Search for them using git grep 'COL729_LAB1 TODO'.

Working with the code

TFG

Recall that a program is represented using TFG. Conceptually, a TFG is a graph with additional features (like input/output). The code follows this conceptual structure where a TFG, represented by tfg class, has an is-a relationship with a graph, represented by the graph class. The graph class is mainly concerned with "graph" properties: edges and nodes, and the tfg class and its more specialized cases, such as tfg_llvm_t represent a program (in the present case, a C function) created from LLVM IR (which itself is created from a C program). The relevant source code paths for "graph" and "TFG" related functionality are include/graph/ and include/tfg/, respectively.

DFA

The DFA framework provides a parameterized implementation of the worklist algorithm. The parameterization allows the user to reuse the same base code for implementing all data-flow analyses.

To implement an analysis in the DFA framework, one needs to subclass the data_flow_analysis and provide an overriding implementation for the xfer_and_meet() function, which accepts three parameters: (1) the corresponding edge, (2) input value (i.e. IN), (3) output value (i.e. OUT) passed as a reference. The function is supposed to compute the new OUT value using the IN value and the old OUT value (by using meet() for the latter) and indicate as the return value whether there was a change in the OUT value. The DFA framework is defined in include/graph/dfa.h. Some already existing analyses can be found in files include/graph/var_liveness_dfa.h and include/tfg/liveness_dfa.h.

Studying the implementation of xfer_and_meet of existing analyses would give you enough information to implement the anticipated expressions analysis. You may also find the generic loc (see following subsection) "DFA value" class definition (defined in include/graph/graph_per_loc_dfa_val.h) helpful. Most things remain the same in the analyses: you need the expressions generated and killed in each edge (see populate_gen_and_kill_sets_for_edge()), you need to kill those expressions which depend on a killed expression (see graph_get_locs_potentially_read_in_expr()) and so on.

Expressions and Locs

Expressions (expr)

We represent an (AST) expression using the expr class (see include/expr/expr.h). expr is a sum type (also known as union type) representing either a variable, a constant, or an operation over sub-expression(s), which are also exprs themselves. Each expr is associated with a sort representing the sort (or type) of the value the expr is holding.

As an expr is essentially a DAG (directed acyclic graph) of values (which could be exprs), it is serialized as a DAG. The serialization format is a sequence of statements where each statement is an expr potentially referring to an earlier expr:

      EXPR_ID : NAME or VALUE or OPERATION : SORT
Here, EXPR_ID is an integer identifier that can be used to refer to this expr, NAME is a variable name, VALUE is a constant value, OPERATION is an operation (operator applied to (references to) earlier expressions), and SORT is, well, the sort of the expr. Some examples:

expr_ref: In the code, you mostly see references to expr_ref instead of expr. It is because an expr_ref is a reference (pointer) to an immutable expr object which is reference-counted to simplify memory management.

Locs (graph_loc_id_t)

Conceptually, a loc represents a memory location that can have an associated expression (expr). Thus, a loc is one of the following:

The framework reasons about memory locations using locs. For example, a write to an LLVM variable would modify the loc corresponding to that variable, a write to a memory location would modify all memory slots whose address can potentially alias with the target of memory write, and so on.

It is possible to get the expr_ref associated with a loc (graph_loc_id_t in code) and vice-versa using two using helper functions: graph_loc_get_expr() and graph_get_locs_potentially_read_in_expr().

Some useful functions, such as populate_gen_and_kill_sets_for_edge(), return locs instead of expressions.

Skeleton code

In order to simplify things and allow you to focus on the analysis-relevant parts, some skeleton code is provided. In particular:

Running and testing your changes

Once you have filled-in the required code, you need to uncomment the driver code in compute_anticipated_exprs() function of lib/graph/graph_with_locs.cpp to run the anticipated expressions analysis. Remember to delete the return {}; line.

Steps for building and running the code:

  1. Run make in superopt-project to build your changes.
  2. Now switch the directory to lab1_test and run ./c2bc.sh a.c to compile C file a.c to LLVM bitcode a.c.bc.
  3. While still in lab1_test, run ./run.sh a.c.bc to convert the LLVM bitcode to TFG and run the analysis. The output TFG file name will be named ${INPUT_FILENAME}.etfg, e.g., a.c.bc.etfg for a.c.bc. The log (stdout) of run is automatically saved in last_run.log.
    You can turn on debug output by appending --dyn_debug=DEBUG_CLASS to llvm2tfg command in run.sh. DYN_DEBUG statements in your code will get executed if their DEBUG_CLASS matches the supplied DEBUG_CLASS. For example, in order to execute this statement:
    DYN_DEBUG(anticipated_analysis, cout << "Here" << endl);
    you will supply --dyn_debug=anticipated_analysis to llvm2tfg.

Visual representation of TFG

For debugging, you may also find the to_dot.sh tool (present in lab1_test/to_dot.sh), helpful. It makes a Graphviz DOT file out of a TFG. The created DOT file can be visualized using xdot or a similar DOT viewer to see the graphical structure of the TFG. The tool accepts two parameters: function name and TFG file name, where the function name is the function's name for which graphviz DOT graph is to be created. The output file name is named as "<TFG file name>.<function name>.dot". Example: ./to_dot.sh sum a.c.bc.etfg will create a.c.bc.etfg.sum.dot.

FWIW, the graph in the example above was created using this tool.

Interpreting the results

The result of your anticipated expressions analysis will be available in serialized format in the output TFG file: search for text =Anticipated expressions in. Example:
=Anticipated expressions in src.llvm.sum
=anticipated expressions at L0%0%1 in src.llvm.sum
=anticipated_exprs begin
=expr
1 : 0 : BV:32
2 : input.src.llvm-%n : BV:32
3 : bvslt(1, 2) : BOOL
=expr
=anticipated_exprs end
=anticipated expressions at Lentry%1%1 in src.llvm.sum
=anticipated_exprs begin
...
...
...
=anticipated_exprs end
=anticipated expressions at Lfor.cond%1%1 in src.llvm.sum
...
...
...
The above example represents the following information: at PC L0%0%1, the expression bvslt(0, input.src.llvm-%n), i.e. (0 < %n) is anticipated (we can probably guess here that the first statement of the corresponding C code has a check with n > 0 in it, where n is a parameter).

Requirements

Your implementation should be able to find all anticipated expressions at every PC in the input program. As the framework can transparently handle things like aliasing and function calls, your implementation should not assume the absence of any program features.

Turn in

Submit your changes in the form of a compressed patch file:

NOTE: It is your responsibility to ensure that a sequence of gunzip -c ${your_group_members_entry_nums}.gz | patch -p1, when executed in superopt-project root with HEAD at col728_lab1, produces a tree equivalent to yours.

NOTE: Please start early in this lab. The documentation is almost non-existent, and the code could be more beginner-friendly. It may take some time to get used to it and be able to make any changes.