Formal languages

Let \Sigma be a set of characters (an alphabet).

A language over \Sigma is a set of strings of characters drawn from \Sigma

e.g., English language: Alphabet = characters, Language = Sentences.

Another example: Alphabet = ASCII character set, Language = C programs

Meaning function L maps syntax to semantics. e.g., L(e) = M

Example: e could be a regular expression, and M would be the set of strings that it represents

Meaning of regular expressions:

L(\epsilon) = { "" }
L('c') = { "c" }
L(A + B) = L(A) U L(B)
L(AB) = {ab | a \in L(A) and b \in L(B)}
L(A*) = U(i>= 0)L(A^i)

Significance of the meaning function: separates syntax from semantics (we can change syntax without changing semantics; different syntax good for different languages/environments). Also, allows for redundancy: multiple syntaxes for the same semantics, e.g., some are concise while others are more efficient (optimization!)

Roman numerals vs. decimal number system: the meanings of both number systems are same, but the notation is extremely different

Example of redundancy: L(0*) = L(0 + 0*) = L(\epsilon + 00*) = ...

Thus L is a many-to-one function, and this is the basis of optimization. L is never one-to-many

Lexical Specifications

Keyword: "if" or "else" or "then" or ...
       : 'i''f' + 'e''l''s''e' + 't''h''e''n' + ...
       : 'if' + 'else' + 'then' + ... (shorthand)

Integers = non-empty string of digits

digit = '0' + '1' + '2' + ...
Integer = digit.digit*    (ensures that there is at least one digit, also represented as digit+)

Identifier = strings of letters or digits, starting with a letter

letter = 'a' + 'b' + ... + 'z' + 'A' + ... + 'Z' (shorthand: [a-zA-Z])
Identifier = letter(letter + digit)*

Whitespace = a non-empty sequence of blanks, newlines, and tabs

Whitespace = (' ' + '\n' + '\t')+

Example of a real language: PASCAL

digit = '0' + '1' + '2' + ...
digits = digit+
opt_fraction = ('.' digits) + \epsilon
opt_exponent = ('E'('+' + '-' + \epsilon) digits) + \epsilon
num = digits opt_fraction opt_exponent

Given a string s and a regular expression R, does s \in L(R)?

Some additional notation (for conciseness)

A+ = AA*
A | B = A + B
A? = A + \epsilon
[a-z] = 'a' + .. + 'z'
[^a-z] = complement of [a-z]

Lexical analysis

  1. First step: write a regular expression for each token class (e.g., Keyword, Identifier, Number, ...)
  2. Construct a giant regular expression that is formed by
       R = Keyword + Identifier + Number + ...
       or R = R1 + R2 + R3 + R4 + ....
  3. For input x1...xn

    For 1 <= i <= n, check
    x1...xi \in L(R)

  4. If success, we know that x1...xi \in L(Rj) for some j
  5. Remove x1...xi from the input and go to step 3
This algorithm has some ambiguities.
  1. How much input is used?
    x1...xi \in L(R)
    x1...xj \in L(R)
    i != j
    e.g., '=' vs. '=='. The answer is that we should always take the longer one. This is called "Maximal Munch". This is how we process the English language as well, for example (this is how humans work). E.g., if we see the word 'inform', we do not read it as 'in' + 'form' but as a single word 'inform'.
  2. Which token is used?
    x1...xi \in L(Rj)
    x1...xi \in L(Rk)
    e.g., Keywords = 'if' + 'else' + ...
    Identifiers = letter(letter + digit)*

    Should we treat 'if' as a keyword or an identifier?

    Answer: use a priority order. e.g., Keywords higher priority than Identifiers

  3. What if no rule matches?
    x1...xi \notin L(R)
    This will cause the program to terminate, as the program does not know what to do.

    Answer: do not let this happen. Create another token class called Error and put it last in the priority order.

    Error = all strings not in the lexical specification

Summary: a simple lexical analysis can be achieved using regular expressions + a few rules to resolve ambiguities and errors.

Good algorithms are known to do such lexical analysis (subject of future lectures)



Conversion from NFA to DFA

Conversion from Regular language to NFA

Implementing an automaton

In practice, lexical analysis give you configuration options to choose between DFAs and NFAs-based implementations (for the user to choose between space-time tradeoffs).