Regular Expressions to DFA
Learn how to formally specify tokens using Extended Regular Expressions and compile them directly into a Deterministic Finite Automata (DFA) using the Direct Method.
Extended Regular Expressions
Regular Expressions (REs) are the standard notation for specifying the patterns that lexemes must match to form a token. While basic REs only include Union (|), Concatenation, and Kleene Star (*), Extended Regular Expressions add syntactic sugar for convenience:
One or more instances (+)
r+Matches one or more occurrences of r. Equivalent to r(r*).
Zero or one instance (?)
r?Matches zero or one occurrence of r. Equivalent to (r | ε).
Character Classes
[a-z]Matches any character in the specified range. Equivalent to a|b|c...|z.
RE to DFA (Direct Method)
Instead of converting an RE to an NFA (using Thompson's construction) and then converting that NFA to a DFA (using Subset Construction), the Direct Method constructs a DFA straight from an augmented Regular Expression.
Augment the Regular Expression
Concatenate the RE with a special end marker #. For an expression r, form (r)#.
Construct a Syntax Tree
Build a syntax tree for the augmented expression. The leaves represent operands (characters and #), and internal nodes represent operators (|, •, *).
Assign unique integer positions $1, 2, ..., n$ to the leaf nodes (excluding ε).
Compute Functions
Traverse the tree bottom-up to compute three properties for each node:
- nullable(n): True if the subexpression at node n can generate the empty string ε.
- firstpos(n): Set of positions that can match the first symbol of a string generated by the subexpression at n.
- lastpos(n): Set of positions that can match the last symbol of a string generated by the subexpression at n.
Compute followpos
For every position p, compute followpos(p), which is the set of positions that can directly follow p in the final string.
- If node n is a concatenation
c1 • c2, then for every position i inlastpos(c1), all positions infirstpos(c2)are infollowpos(i). - If node n is a star node
c*, then for every position i inlastpos(c), all positions infirstpos(c)are infollowpos(i).
Construct DFA States
The initial state of the DFA is the firstpos of the root node.
For any given state subset S and input symbol a, the transition is to the state consisting of the union of followpos(p) for all p in S that correspond to symbol a.
Accepting states are those that contain the position corresponding to the end marker #.