Let \Sigma be a set of characters (an alphabet).
A language over \Sigma is a set of strings of characters drawn from \Sigma
e.g., English language: Alphabet = characters, Language = Sentences.
Another example: Alphabet = ASCII character set, Language = C programs
Meaning function L maps syntax to semantics. e.g., L(e) = M
Example: e could be a regular expression, and M would be the set of strings that it represents
Meaning of regular expressions:
L(\epsilon) = { "" } L('c') = { "c" } L(A + B) = L(A) U L(B) L(AB) = {ab | a \in L(A) and b \in L(B)} L(A*) = U(i>= 0)L(A^i)
Significance of the meaning function: separates syntax from semantics (we can change syntax without changing semantics; different syntax good for different languages/environments). Also, allows for redundancy: multiple syntaxes for the same semantics, e.g., some are concise while others are more efficient (optimization!)
Roman numerals vs. decimal number system: the meanings of both number systems are same, but the notation is extremely different
Example of redundancy: L(0*) = L(0 + 0*) = L(\epsilon + 00*) = ...
Thus L is a many-to-one function, and this is the basis of optimization. L is never one-to-many
Keyword: "if" or "else" or "then" or ... : 'i''f' + 'e''l''s''e' + 't''h''e''n' + ... : 'if' + 'else' + 'then' + ... (shorthand)
Integers = non-empty string of digits
digit = '0' + '1' + '2' + ... Integer = digit.digit* (ensures that there is at least one digit, also represented as digit+)
Identifier = strings of letters or digits, starting with a letter
letter = 'a' + 'b' + ... + 'z' + 'A' + ... + 'Z' (shorthand: [a-zA-Z]) Identifier = letter(letter + digit)*
Whitespace = a non-empty sequence of blanks, newlines, and tabs
Whitespace = (' ' + '\n' + '\t')+
Example of a real language: PASCAL
digit = '0' + '1' + '2' + ... digits = digit+ opt_fraction = ('.' digits) + \epsilon opt_exponent = ('E'('+' + '-' + \epsilon) digits) + \epsilon num = digits opt_fraction opt_exponent
Given a string s and a regular expression R, does s \in L(R)?
Some additional notation (for conciseness)
A+ = AA* A | B = A + B A? = A + \epsilon [a-z] = 'a' + .. + 'z' [^a-z] = complement of [a-z]
Lexical analysis
R = Keyword + Identifier + Number + ... or R = R1 + R2 + R3 + R4 + ....
x1...xn
For 1 <= i <= n, check
x1...xi \in L(R)
x1...xi \in L(R) x1...xj \in L(R) i != je.g., '=' vs. '=='. The answer is that we should always take the longer one. This is called "Maximal Munch". This is how we process the English language as well, for example (this is how humans work). E.g., if we see the word 'inform', we do not read it as 'in' + 'form' but as a single word 'inform'.
x1...xi \in L(Rj) x1...xi \in L(Rk)e.g., Keywords = 'if' + 'else' + ...
Should we treat 'if' as a keyword or an identifier?
Answer: use a priority order. e.g., Keywords higher priority than Identifiers
x1...xi \notin L(R)This will cause the program to terminate, as the program does not know what to do.
Answer: do not let this happen. Create another token class called Error and put it last in the priority order.
Error = all strings not in the lexical specification
Summary: a simple lexical analysis can be achieved using regular expressions + a few rules to resolve ambiguities and errors.
Good algorithms are known to do such lexical analysis (subject of future lectures)
i = 0 state = 0 while (input[i]) { state = T[state, input[i++]]; }
The lookahead is the maximum number of characters that I need to scan forward before I can start checking for membership. In other words, the maximal munch is guaranteed to match within the next lookahead characters (and no string longer than the lookahead can potentially match). For example, in C, the lookahead could stop at the next space character or the next parenthesis, etc.
If the size of the lookahead string is M, then in the maximal munch algorithm, we may be forced to look at a character at most M times. Thus, the smaller the lookahead string the better.
In addition to the membership, the DFA/NFA accepting states could encode the token class which should be emitted if the string ends at that state. Priority resolution between multiple states can be done similarly.
However note that this is the simplest and the most general form of an algorithm. In practice, for a given language, the lexer is specialized for the set of token classes of that language (e.g., C). Moreover, the DFA is optimized so that if you end in a non-accepting state (due to trying maximal munch), then that state encodes the state and the input position we should resume from (to be able to identify the next token).