Comdes

Introduction to Compilation

Learn how source code is transformed into executable machine code through the sequential phases of a compiler. Understand the foundational role of Lexical Analysis.

Structure and Phases of a Compiler

A compiler translates a high-level source language into a low-level target language. This process is typically split into two main parts: the Analysis (Front-end) and the Synthesis (Back-end).

1Lexical Analysis

Reads characters and groups them into meaningful sequences called Lexemes.

Outputs a stream of Tokens representing these lexemes.

2Syntax Analysis

Also known as Parsing. Uses the token stream to construct a Parse Tree or Syntax Tree.

Validates the grammatical structure.

3Semantic Analysis

Checks for semantic consistency (e.g., type checking, variable declaration).

Uses the syntax tree and the Symbol Table.

4Intermediate Code Gen

Generates an explicit low-level or machine-independent intermediate representation.

Often takes the form of Three-Address Code (TAC).

5Code Optimization

Improves the intermediate code to result in faster-running or shorter machine code.

Includes loop unrolling, constant folding, and dead code elimination.

6Code Generation

Maps the optimized intermediate representation to the target machine language.

Involves Register Allocation and instruction selection.

Symbol Table Manager

A data structure used by all phases to store identifiers (variable names, functions) along with their types, scope, and memory locations.

Tokens, Lexemes, and Attributes

Lexemes

A sequence of characters in the source program that matches the pattern for a token. It is the actual "string" found in the code.

Example: count, =, 100

Tokens

A pair consisting of a Token Name and an optional Attribute Value. The token name is an abstract symbol representing a lexical unit.

Example: 〈id, ...〉, 〈assign, ...〉, 〈number, ...〉

Attributes

Additional information about the matched lexeme. For an identifier, this might be a pointer to the Symbol Table entry.

Example: 〈id, ptr-to-"count"〉

Interactive Lexical Analyzer

Write some C-style code and watch as the Lexer breaks it down into individual tokens, identifies lexemes, and builds a symbol table in real-time.

Modern Compilation: LLVM & Lex

Lex: A Lexical Analyzer Generator

Writing a scanner by hand can be tedious. Lex (or Flex) is a tool that takes a set of regular expressions (patterns) and automatically generates C code for a deterministic finite automaton (DFA) that recognizes those patterns.

You specify the pattern (e.g. [a-zA-Z]+) and the action (e.g. return ID;). Lex handles the state machine transitions.

Introduction to LLVM

LLVM (Low Level Virtual Machine) is a modern, modular compiler infrastructure project. Unlike monolithic legacy compilers, LLVM strictly divides compilation using a universal Intermediate Representation (IR).

  • Front-ends (like Clang for C/C++) compile source code into LLVM IR.
  • Optimizers apply transformations directly to the IR.
  • Back-ends compile the optimized IR into target-specific machine code (x86, ARM, etc).