Revert to the (unambiguous) natural grammar for our example
E --> T + E | T T --> int * T | int | (E)Consider the string
int * int + int
Bottom-up parsing reduces a string to the start symbol by inverting productions
int * int + int int * T + int T --> int int * T + int T + int T --> int * T T + int T + T T --> int T + T T + E E --> T T + E E E --> T + E EReading the productions bottom-to-top, these are the productions.
What's the point? The productions read backwards, trace a rightmost derivation, i.e., we are producing the right-most non-terminal at each step
Important fact #1 about bottom-up parsing
Bottom-up parsing traces a right-most derivation in reverse
Picture
int * int + int | | | | T | \ / | T | | T \ / \ / \ / E
Consequence: Whenever we reduce abw -> aXw
(using X -> b
),
it must be true that w
is a string of terminals.
abw
be a step of the bottom-up parse
X--> b
w
is a string of terminals
aXw --> abw
is a step in a right-most derivation
Idea: split string into two substrings
a|w
where a
is a string
of terminals and non-terminals, and w
is a string of only terminals.A --> xy
is a production, then
Cbxy|ijk => CbA|ijk
Example : int * int + int
Assume an oracle tells us whether to shift of reduce
|int * int + int shift int |* int + int shift int * |int + int shift int * int| + int shift int * T| + int reduce T | + int reduce T +| int shift T + int| shift T + T| reduce T + E| reduce E| reduceLeft string can be implemented by a stack, because we only reduce the suffix of the left string. Top of the stack is |
shift
pushes a terminal on the stack
reduce
pops off of the stack (production rhs) and pushes a non-terminal on the stack (production lhs)
In a given state, more than one action (shift or reduce) may lead to a valid parse.
If it is legal to shift or reduce, there is a shift-reduce conflict. Need some techniques to remove them
If it is legal to reduce using two productions, there is a reduce-reduce conflict. Are always bad and usually indicate a serious problem with the grammar.
In either case, the parser does not know what to do, and we either need to rewrite the grammar, or need to give the parser a hint on what to do in this situation.
Solution 1: backtracking: try both shift and reduce (exponential search space)
Solution 2: predictive LR(k)
Example : int * int + int
|int * int + int shift int |* int + int shiftAt this point, we could reduce. T |* int + int reduce But this would be a fatal mistake because there is no way we can reduce to E (there is no production that begins with T *).
Intuition: want to reduce only if the result can still be reduced to the start symbol (E)
Assume a right-most derivation:
S -->* \alpha X \omega --> \alpha \beta \omegaThen
\alpha \beta
is a handle of \alpha \beta \omega
Is int
a handle of int*int+int
?
E --> T+E --> T+T --> T+int --> int*T+int --> ...The
int
at the leftmost position was produced from T-->int*T
and not a reduction
to only int
. Hence, int
is not a handle of int*int+int
.
Is int*T
a handle of int*int+int
? Yes.
Is int*T+E
a handle of int*int+int
?
E --> T+E --> T + T --> ...Because, in a rightmost deriviation, we produce
T
from E
at the rightmost position in
the very beginning, int*T+E
is not a handle of int*int+int
.
Is \epsilon
a handle of int*int+int
?
E --> T+E --> ...No. We never produce $\epsilon$ at the leftmost position in this rightmost derivation.
Handles formalize the intuition: A handle is a reduction that allows further reductions back to the start symbol.
We only want to reduce at handles.
Note: we have said what a handle is, not how to find handles.
Important fact #2: In shift-reduce parsing, handles appear only at the top of the stack, never inside. Proof by induction
Hence shift-reduce parsers can only shift or reduce at the top of the stack. However, how to recognize handles, i.e., when to shift and when to reduce?
Heuristics to identify handles. On certain types of CFGs, heuristics are always correct.
Strategy: try and rule out non-handles. Even better: rule out prefixes that can never result in handles.
int|*int+int
: With lookahead one, the parser knows that the next token is *
. If the string begins with T*
, it has no hope of being reduced to E
(for any possible string of terminals that follows). In other words, T *
is not a viable prefix.
If there is a valid right-most derivation of the string: S --> ... --> \alphaX\omega --> \alpha\beta{}\omega --> ... --> string
, then \alpha{}\beta{}
and all its prefixes are viable prefixes. e.g., \epsilon
is always a viable prefix
Example:
S --> E$ E --> T T --> intA rightmost derivation is:
S --> E$ --> T$ --> int$
. Here E$
is a viable prefix (production S-->E$
with X=S
and \beta=E$
). T
is a viable prefix (with X=E
and \beta=T
). But T$
is not a viable prefix, as there is nothing that generates T$
in one step. Similarly, int$
is not a viable prefix.
Examples. Consider the grammar below.
S --> E$ E --> T+E | T T --> int*T | int | (E)Which of these are viable prefixes?
T
? Yes, because S --> E$ --> T+E$ --> ...
is a valid rightmost derivation.\epsilon
? Yes, always.int*
? Yes, S --> E$ --> T+E$ --> ... -> int*T+...
T*
? Consider the possibilities: S --> E$ --> T+E -->
is not helpful; neither is S--> E$ --> T --> ...
. So, no.
int*T+T
? No. (I must first produce the rightmost T
to something before I can expand the left T
to int*T
).
E
? Yes.
In other words, \alpha is a viable prefix iff there is an \omega such that \alpha|\omega is an intermediate state of a shift-reduce parser in a valid parse of any string. Here $\alpha$ is the stack and $\omega$ is the rest of the input. It can lookahead at $\omega$, but the parser does not know the whole thing. The parser knows the whole stack.
What does this mean? A viable prefix does not extend past the right end of the handle. It's a viable prefix because it is a prefix of the handle. As long as a parser has viable prefixes on the stack, no parsing error has been detected.
Predictive shift-reduce parsing: based on the lookahead, check if any of the shift or reduce choices do not produce a viable prefix. If so, discard that choice. For example, for lookahead one, consider the following state:
\alpha\beta\gamma|a\omega
\alpha\beta\gamma{}a
a viable prefix?\alpha\beta{}Xa
a viable prefix?
Conversely, In a string is parse-able through the bottom-up parser, then every potential state of the stack should always be a viable prefix.
Consider the string (int * int)
:
int *|int)
is a state of a shift-reduce parse
T --> (E)
E --> T
. This is an interesting case as we are considering "\epsilon" as a prefix too!
T --> int * T
Venn diagram: All CFGs \superset Unambiguous CFGs \superset LR(1) CFGs \superset LALR(1) CFGs \superset SLR \superset LR(0)
LR(k) are quite general, but most practical grammers are LALR(k). SLR(k) are simplifications over LALR(k) [simple LR grammars].
LL(1) is a subset of LR(1) but can cut across LR(0), SLR, LALR(1).
Consider the input (int)
for the grammar:
E --> T + E | T T --> int * T | int | (E)
(E|)
is a state of a shift-reduce parse
(E
is a prefix of the RHS of T --> (E)
T --> (E.)
says that so far we have seen (E
of this production and hope to see )
The stack may have many prefixes on RHS's:
Prefix1 Prefix2 Prefix3 ... Prefix(n-1) Prefix(n)Let
Prefix(i)
be a prefix of RHS of Xi --> \alpha_i
Prefix(i)
will eventually reduce to X(i)
X(i-1) --> Prefix(i-1)Xi\beta
production
X(i-2) --> Prefix(i-2)X(i-1)\gamma
production.
Prefix(k+1) ... Prefix(n-1) Prefix(n)
eventually reduces to the missing part of \alpha(k)
Important fact #3 about bottom-up parsing: For any grammar, the set of viable prefixes is a regular language. The regular language represents the language formed by concatenating 0 or more prefixes of the productions (items).
For example, the language of viable prefixes for the example grammar:
S --> \epsilon | [S]is
\epsilon | "["* | "["+S | "["+S"]"Notice that
"[["+S"]]"
or []
are not viable prefixes.
The problem in recognizing viable prefixes is that the stack has only bits and pieces of the RHS of productions
An item is a production with a "." somewhere on the RHS in the production:
e.g., the items for T --> (E)
are: T --> .(E)
, T --> (.E)
, T --> (E.)
, T --> (E).
.
The only item for an \epsilon production, X --> \epsilon
is X --> .
. Items are often called "LR(0) items".
To recognize viable prefixes, we must
S' --> S
to G
S'
) is used only in one place which is the LHS of this production
NFA(stack) = yes if stack is a viable prefix = no otherwise
S' --> S
E --> \alpha.X\beta
, add transition from
(E --> \alpha.X\beta) --X--> (E --> \alphaX.\beta)
E --> \alpha.X\beta
and production X --> \gamma
, add
(E --> \alpha.X\beta) --\epsilon--> (X --> .\gamma)
T --> (.E) E --> .T T --> int * .Tsays that
T --> (E)
E --> T
T --> int * T
In other words, every viable prefix can be represented as a stack of items, where the (n+1)th item is a production for a non-terminal that follows the "." in the nth item.
The items in s
describe what the top of the item stack might be after reading input \alpha
An item is often valid for many prefixes. e.g., The item T --> (.E)
is valid for prefixes
( (( ((( (((( ...We can see this by looking at the DFA, which will keep looping into the same DFA state for each open paren.
Need to show the NFA and DFA construction for our example grammar, and the valid items for these string prefixes. Will need a laptop and a projector!
The language of viable prefixes for the example grammar:
S --> \epsilon | [S] | S.Sis
X = \epsilon | "[" | "["* | "["*S | "["*S"]" Y = "["*S.(X | Y)* Z = X + Y
t
s
s
.X --> \beta
if
s
contains item X --> \beta
s
contains item X --> \beta.t\omega
s
has a transition labeled t
X --> \beta.
and Y --> \omega.
X --> \beta.
and Y --> \omega.t\delta
Resolve conflicts by increasing lookahead. For example, during a shift-reduce conflict as shown in the example above, if the next token is not t
, then the shift possibility can be discarded.
LR(1) parsing (we are not going to discuss this in detail): bake the lookahead into items. An item now looks like X-->\beta., t
to indicate that if the next terminal is t
, then reduce (but do not reduce if the next terminal is not t
).
SLR = "Simple LR": improves on LR(0) shift/reduce heuristics so fewer state have conflicts
Idea: Assume
t
s
X --> \beta
only if
X --> \beta
t \in Follow(X)
[only change to LR(0)]
s
contains item X --> \beta.t\omega
If there are conflicts under these rules, the grammar is not SLR
E --> E+E|E*E|(E)|int
E-->E*E.
and E-->E.+E
SLR Parsing algorithm
M
be DFA for viable prefixes of G
|x1...xn$
be initial configuration
S|$
\alpha|\omega
be current configuration
M
on current stack \alpha
M
rejects \alpha
, report parsing error
\alpha
is not a viable prefix
M
accepts \alpha
with items I
, let a
be next input
X --> \beta.a\gamma \in I
X --> \beta. \in I
and a \in Follow(X)
If there can be a conflict in the last step, the grammar is not SLR. To check if there can be a conflict, check if for some state, both shifting and reducing is an option.
<Symbol, DFA state>
<sym1,state1>...<symn,staten>
staten
is the final state of the DFA on sym1..symn
<any,start>
where
any
is any dummy symbol
start
is the start state of the DFA
goto[i,A] = j
if statei -->A statej
goto
is just the transition function of the DFA
Shift x
a
is current input
x
is a DFA state
Reduce X --> \alpha
Action table: For each state si
and terminal a
si
has item X --> \alpha.a\beta
and goto[i,a] = j
, then action[i,a] = shift j
si
has item X --> \alpha.
and a \in Follow(X)
and X != S'
, then action[i,a] = reduce X-->\alpha
si
has item S'-->S.
, then action[i,$] = accept
action[i,a]=error
SLR Improvements
S'-->.S
|A|
pairs
<X,goto[top_state(stack),X]>
[T --> .int *T,$]
means
T--> int * T
, reduce if lookahead is $
T --> S'$ S' --> S S --> Sa S --> bSLR parsers do not mind left-recursive grammars
The first state in the corresponding DFA will look like (\epsilon closure of the NFA state): (state 1)
S' --> .S S --> .Sa S --> .bIf we see a
b
in this state, we get another DFA state: (state 2)
S --> b.Alternatively, if we see a
S
in this state, we get another DFA state: (state 3)
S' --> S. S --> S.aFrom this state, if we see
a
, we get (state 4)
S --> Sa.
The only state with a shift-reduce conflict is state3. Here, if we look at follow of S', we have only "$", and hence we can resolve this conflict by one lookahead. Hence this is an SLR grammar
If we get rid of non-terminal S'
above, by short-cutting T --> S$
, the grammar above becomes LR0.
Another example grammar
T --> S'$ S' --> S S --> SaS S --> bLooking at the corresponding DFA: (state 1)
S' --> .S S' --> .SaS S --> .bOne possibility is that we see
b
in this state to get: (state 2)
S --> b.Another possibility is that we see
S
in this state to get: (state 3)
S' --> S. S' --> S.aSIf we get
a
in this state, we get (state 4)
S' --> Sa.S S --> .SaS S --> .b(notice that we formed an \epsilon-closure of the first item to add more items
From here, if we get S
, we get the following state: (state 5)
S --> SaS. S --> S.aSFrom here, if we get
a
again, we go back to state 4! If we get b
, we go to state 2
The only states that have conflicts are: state3 (resolved by follow as follow(S') = $) and state5 (has a shift/reduce conflict because a
\in follow(S))
Thus this is not an SLR grammar
If we get rid of non-terminal S'
above, by short-cutting T --> S$
, the grammar above becomes LR0!