Introduction to Compilation
Learn how source code is transformed into executable machine code through the sequential phases of a compiler. Understand the foundational role of Lexical Analysis.
Structure and Phases of a Compiler
A compiler translates a high-level source language into a low-level target language. This process is typically split into two main parts: the Analysis (Front-end) and the Synthesis (Back-end).
Reads characters and groups them into meaningful sequences called Lexemes.
Outputs a stream of Tokens representing these lexemes.
Also known as Parsing. Uses the token stream to construct a Parse Tree or Syntax Tree.
Validates the grammatical structure.
Checks for semantic consistency (e.g., type checking, variable declaration).
Uses the syntax tree and the Symbol Table.
Generates an explicit low-level or machine-independent intermediate representation.
Often takes the form of Three-Address Code (TAC).
Improves the intermediate code to result in faster-running or shorter machine code.
Includes loop unrolling, constant folding, and dead code elimination.
Maps the optimized intermediate representation to the target machine language.
Involves Register Allocation and instruction selection.
A data structure used by all phases to store identifiers (variable names, functions) along with their types, scope, and memory locations.
Tokens, Lexemes, and Attributes
Lexemes
A sequence of characters in the source program that matches the pattern for a token. It is the actual "string" found in the code.
Example: count, =, 100
Tokens
A pair consisting of a Token Name and an optional Attribute Value. The token name is an abstract symbol representing a lexical unit.
Example: 〈id, ...〉, 〈assign, ...〉, 〈number, ...〉
Attributes
Additional information about the matched lexeme. For an identifier, this might be a pointer to the Symbol Table entry.
Example: 〈id, ptr-to-"count"〉
Modern Compilation: LLVM & Lex
Lex: A Lexical Analyzer Generator
Writing a scanner by hand can be tedious. Lex (or Flex) is a tool that takes a set of regular expressions (patterns) and automatically generates C code for a deterministic finite automaton (DFA) that recognizes those patterns.
You specify the pattern (e.g. [a-zA-Z]+) and the action (e.g. return ID;). Lex handles the state machine transitions.
Introduction to LLVM
LLVM (Low Level Virtual Machine) is a modern, modular compiler infrastructure project. Unlike monolithic legacy compilers, LLVM strictly divides compilation using a universal Intermediate Representation (IR).
- Front-ends (like Clang for C/C++) compile source code into LLVM IR.
- Optimizers apply transformations directly to the IR.
- Back-ends compile the optimized IR into target-specific machine code (x86, ARM, etc).