Several interesting and intuitive languages are not regular
{ (^i )^i | i >= 0 }This is fairly representative of several programming constructs, e.g., nested arithmetic constructs, nested scopes, nested if-then-else, etc. Programming languages usually have a natural recursive structure (often this is the structure of human thinking)
Regular expressions cannot represent this language. Intuitively, this language requires maintaining an unbounded amount of memory (specified by i), while FAs only have a bounded amount of memory (the number of states)
Example input: if x == y then z = 1; else z = 2;
Example output level 1: if cond then statement else statement
Example output level 2: if-then-else statement
Example output level 3: statement
The parse tree may be implicit, may not be output explicitly by the compiler.
For parsing, we need:
Recursive structure of programming languages are a good fit for Context-Free grammars
EXPR = if EXPR then EXPR else EXPR; | while EXPR ( EXPR )
A CFG consists of:
X -> Y1Y2..Yn
Strings of balanced parantheses:
S -> (S) S -> \epsilonHere,
N = {S} T = {'(', ')'}
Productions can be read as replacement rules: the RHS can replace the LHS
In other words, at any step of this derivation we perform the following replacement:
X1X2X3..Xi X Xi+1..Xn --> X1X2..Xi Y1Y2..Yk Xi+1..Xn
A derivation is of the form:
S --> ... -> \alpha0 --> ... --> \alphaN N >= 0
Let G be a CFG with start symbol S. L(G) of G is:
{a1...an | ai \in T and S ---*--> a1...an}Terminals are called so because there are no rules for replacing them
For a parser, the terminals are the tokens
A fragment of C (this notation is a more concise form of writing the productions, and is used in actual tools)
selection_statement: IF '(' expression ')' statement ELSE statement | IF '(' expression ')' statement | SWITCH '(' expression ')' statement; iteration_statement: WHILE '(' expression ')' statement | DO statement WHILE '(' expression ')' ';' | FOR '(' expression_statement expression_statement ')' statement | FOR '(' expression_statement expression_statement expression ')' statement | FOR '(' declaration expression_statement ')' statement | FOR '(' declaration expression_statement expression ')' statement; expression_statement: ';' | expression ';'; statement: labeled_statement | compound_statement | expression_statement | selection_statement | iteration_statement | jump_statement;
Some elements of this language?
Another example: arithmetic expressions
E --> E + E E --> E * E E --> ( E ) E --> Id
The idea of a CFG is a big step.
Form of the grammar is important
A derivation can be drawn as a tree
Example: id * id + id
. Generate the derivation tree for the grammar of arithmetic expressions. This derivation tree is called the parse tree of this string. While building the tree, also write the derivation rules that have been used at each step to the left.
A parse tree has
An in-order traversal of the parse tree yields the original input
Multiple parse trees are possible for the same input (even for this example)
Two derivations may produce the same or different parse trees.
The parse tree shows the association of the operations, the original input does not.
The example we discussed is a leftmost-derivation: at each step, replace the left-most non-terminal. There is an equivalent notion of a rightmost-derivation.
A derivation defines a parse tree. But one parse tree can have multiple derivations. The leftmost and rightmost derivations are important from the point-of-view of parser implementation (as we will see later).
Note that the leftmost and rightmost derivations have the same parse tree. This is just about the order in which the branches of the parse tree were added.
Example: id*id + id
This string has two parse-trees in the grammar of arithmetic operations: (id * id) + id AND id * (id + id)
A grammar is ambiguous if it has more than one parse tree for some string.
Ambiguity may leave the meaning of the program ill-defined.
Several ways to handle ambiguity: One method is to rewrite the grammar unambiguously (by "stratifying" it)
E --> E' + E | E' E' --> id * E' | id | (E) * E' | (E)Show the unique left-most derivation for
id*id + id
using this new grammar.
Another example:
E --> if E then E | if E then E else E | ...String : if E1 then if E2 then E3 else E4
E --> MIF /* all if are matched with else */ | UIF /* some if is unmatched */ MIF --> if E then MIF else MIF | id UIF --> if E then MIF else UIF /* the then expression is MIF, because if it were UIF then this else should have matched it! */ | if E then EShow the two parse trees, and show which one is rejected by this new grammar.
Impossible to convert automatically an ambiguous grammar to an unambiguous one
Used with care, ambiguity can sometimes simplify the grammar
Most languages and parsing tools
The parser will use these declarations to decide which move to take in case of ambiguity. There are primarily three common types of ambiguity resolution in PLs: associativity (left or right), precedence (* before +), and bind-to-innermost (or bind-to-outermost). The first two types are easy to specify in parser generators; the last type is harder and requires the programmer to manually change the grammar.
Purpose of a compiler is:
Many kinds of possible errors (e.g., in C):
Error handlers should
Error handling modes
Panic mode is the simplest
Example: (1 + + 2) + 3 The parser is going to get stuck at the second +. In panic mode recovery, skip ahead to next integer and then continue. (In this example, it would restart at 2, and treat it as 1 + 2, and then it would parse fine).
Bison: use the special terminator error
to describe how much input to skip (error productions)
E --> int | E + E | (E) | error int | (error)The fourth production says that if you see an error while trying to parse E, then the
error
symbol will match all the input until an int.
Similarly, the fourth production says that if you encounter an error while you are inside a paranthesized expression, you can discard everything until the closing bracket.
Error productions
Error correction: find a "nearby" program. Notion of "edit distance"
Review
Consider the grammar
E --> int | (E) | E + EAnd the string
5 + (2 + 3)After lexical analysis, a list of tokens
int5 '+' '(' 'int2' '+' 'int3' ')'During parsing we build a parse tree.
Draw a parse tree.
root = E, second level expands E + E, third level expands the first E into int5 and the second E into (E), and so on... The parse tree is sufficient for compilation (it captures the nesting structure) but it has too much information, such as parantheses and single-successor nodes, etc. AST is a cleaned-up and more concise version of the parse tree. Parentheses are redundant because the tree structure tells us the grouping already
AST also captures the nesting structure, but it abstracts from the concrete syntax. It is more compact and easier to use. An important data structure in a compiler
(A): PLUS, (B), (C) (B): 5 (C): PLUS (D), (E) (D): 2 (E): 3AST is an important data structure for the rest of the discussion