Comdes

Regular Expressions to DFA

Learn how to formally specify tokens using Extended Regular Expressions and compile them directly into a Deterministic Finite Automata (DFA) using the Direct Method.

Extended Regular Expressions

Specification of Tokens

Regular Expressions (REs) are the standard notation for specifying the patterns that lexemes must match to form a token. While basic REs only include Union (|), Concatenation, and Kleene Star (*), Extended Regular Expressions add syntactic sugar for convenience:

One or more instances (+)

r+

Matches one or more occurrences of r. Equivalent to r(r*).

Zero or one instance (?)

r?

Matches zero or one occurrence of r. Equivalent to (r | ε).

Character Classes

[a-z]

Matches any character in the specified range. Equivalent to a|b|c...|z.

RE to DFA (Direct Method)

Instead of converting an RE to an NFA (using Thompson's construction) and then converting that NFA to a DFA (using Subset Construction), the Direct Method constructs a DFA straight from an augmented Regular Expression.

The Algorithm
1

Augment the Regular Expression

Concatenate the RE with a special end marker #. For an expression r, form (r)#.

2

Construct a Syntax Tree

Build a syntax tree for the augmented expression. The leaves represent operands (characters and #), and internal nodes represent operators (|, , *).

Assign unique integer positions $1, 2, ..., n$ to the leaf nodes (excluding ε).

3

Compute Functions

Traverse the tree bottom-up to compute three properties for each node:

  • nullable(n): True if the subexpression at node n can generate the empty string ε.
  • firstpos(n): Set of positions that can match the first symbol of a string generated by the subexpression at n.
  • lastpos(n): Set of positions that can match the last symbol of a string generated by the subexpression at n.
4

Compute followpos

For every position p, compute followpos(p), which is the set of positions that can directly follow p in the final string.

Two rules for followpos:
  1. If node n is a concatenation c1 • c2, then for every position i in lastpos(c1), all positions in firstpos(c2) are in followpos(i).
  2. If node n is a star node c*, then for every position i in lastpos(c), all positions in firstpos(c) are in followpos(i).
5

Construct DFA States

The initial state of the DFA is the firstpos of the root node.

For any given state subset S and input symbol a, the transition is to the state consisting of the union of followpos(p) for all p in S that correspond to symbol a.

Accepting states are those that contain the position corresponding to the end marker #.