if (x == y) z = 1; else z = 2;will look like this to the lexical analyzer:
\tif (x == y)\n\t z = 1;\n\telse\n\t z = 2;\nThe lexical analyzer has to put dividers between different units. e.g., a divider between
\t
and if
and <whitespace>
and (
and so on ...
lexer :: string --> [ token ] token = <class, string>Example input:
foo = 42Example output:
<Id, "foo"> <Op, "="> <Int, "42">In our original example
\tif (x == y)\n\t z = 1;\n\telse\n\t z = 2;\nthe relevant token classes are:
A lexical analyzer needs to
VAR1is exactly the same as
VA R1Fortran idea: removing all the whitespace should not change the meaning of the program.
Loop example
DO 5 I = 1,25Here
DO
is a keyword representing a loop (similar to FOR
in C), I = 1,25
represents that the iteration variable I
ranges from 1..25
. The number 5
represents that the loop extends from the DO
statement till the statement with label 5, including all statements that are in between.
Another example:
DO 5 I = 1.25The only difference in this code from the previous code is the replacement of
,
with .
. This simply means that the variable DO5I = 1.25
, i.e., a variable has been assigned an integer (there is no loop).
The problem is that just by looking at the first three characters, I cannot tell whether DO
is a keyword or a prefix of a variable name. Hence we need a lookahead. In this example, a large lookahead is required. Ideally, the language design should ensure that the lookahead is small.
else
,
we need to lookahead to decide whether it is an identifier or a keyword. Similarly, we need lookahead to disambiguate between ==
and =
.
PL/1: keywords are not reserved.
IF ELSE THEN THEN = ELSE; ELSE ELSE = THENThis makes lexical analysis a bit more difficult -- need to decide what is a variable name and what is a keyword, and so need to look at what's going on in the rest of the expression.
In fact, there are examples where PL/1 may require unbounded lookahead!
Some of these problems exist in languages today: e.g., C++
C++ template syntax
Foo<Bar>C++ stream syntax
cin >> varFor a long time, you had to put a blank between two ">" signs in the template syntax. Most C++ compilers have fixed this now.
How much lookahead is required for a language like C? Depends on the implementation, but one idea that would work well is: look ahead till the next whitespace (outside of quotes). For example, consider a string constant in C "a" "b"
. Here, the lexer emits two tokens, one for the string constant "a"
and another for the string constant "b"
; it is left to the parser to interpret these as a single string constant.
'c' = {"c"}
\epsilon = {""}Note that \epsilon (a language that has a single element, namely the empty string) is not the same as \phi (empty language).
Union: A + B = { a | a in A} U { b | b in B}
Concatenation: AB = { ab | a in A and b in B}
Iteration (Kleene closure): A* = A0 + A1 + A2 + ... A\infinity
A0 = \epsilon
Ai = A concatenated with itself i times
Definition: The regular expressions over \Sigma are the smallest set of expressions that includes the following:
R = \epsilon | 'c' | R + R | RR | R*This is called the grammar of the regular expressions. We will learn more about grammars when we talk about parsing (not relevant at this stage).
\Sigma = {0, 1}
1* = all strings of 1s
(1 + 0)1 = {11, 01}
0* + 1* = all strings of 0s UNION all strings of 1s
(0 + 1)* = all strings of 0s and 1s (or all strings over the entire alphabet) = \Sigma* (special name for this language)
Many ways of writing any regular language
Regular expressions specify regular languages
Five constructs: Two base cases (\epsilon, single-character strings). Three compound expressions (union, concatenation, iteration)