Optimization & Parallelism

Unlock the full potential of hardware by optimizing for cache locality, vectorization, and instruction-level parallelism.

Memory Hierarchy & Cache Locality

Two Principal Data Spaces

Modern compilers must manage two distinct architectural "spaces" to achieve performance:

The Iteration Space: The multi-dimensional space formed by nested loops. The compiler analyzes data dependencies here.
The Data Space: The multi-dimensional array layouts in memory. The compiler transforms iteration spaces to match data spaces.

Cache Locality

Processors fetch data in blocks (Cache Lines). A compiler optimizes for:

Spatial Locality: Accessing data that is stored close together in memory (e.g., iterating a matrix row-by-row in C).
Temporal Locality: Re-using the same data within a short time frame so it doesn't get evicted from the cache.

Loop manipulation like Loop Tiling (blocking) drastically improves cache hit rates for Matrix Multiplication.

Instruction-Level Parallelism (ILP)

Modern CPUs contain multiple execution units capable of executing several instructions simultaneously, provided those instructions do not depend on each other.

Instruction Scheduling

The compiler analyzes the Data Dependency Graph of a basic block. By reordering instructions, the compiler can overlap the long latencies of memory loads or complex arithmetic with the execution of independent, shorter instructions.

// Poor Scheduling (Stalls)

1. Load R1, A

2. Add R2, R1, 1 <-- CPU halts waiting for R1

3. Load R3, B

4. Add R4, R3, 1

// Good Scheduling (No Stalls)

1. Load R1, A

2. Load R3, B <-- Executed while waiting for A

3. Add R2, R1, 1

4. Add R4, R3, 1

Software Pipelining

In a tight loop, an iteration often waits for a previous operation to finish. Software Pipelining restructures the loop so that one iteration of the new loop executes parts of several iterations of the original loop simultaneously.

Example: While iteration `i` is storing its result, iteration `i+1` is performing its arithmetic, and iteration `i+2` is issuing memory loads.

Vectorization (SIMD)

Single Instruction, Multiple Data (SIMD). The compiler automatically detects inner loops that apply the same operation to an array of data. It replaces sequential scalar instructions with special Vector Instructions (like AVX or NEON) that process 4, 8, or 16 elements in a single CPU clock cycle.

SSA Form

Static Single Assignment (SSA) form is an intermediate representation property. It requires that every variable is assigned a value exactly once.

If a variable x is modified multiple times in the source code, SSA renames it to x1, x2, x3.

When control flows merge (e.g., after an if-else), a special Φ (Phi) function is inserted to choose the correct version of the variable depending on which path was taken. SSA makes data flow analysis vastly simpler and faster.

DSLs for Optimization

Domain Specific Languages (DSLs) play a huge role in modern optimization architectures (like Halide for image processing or TVM for Machine Learning).

Instead of relying on the compiler to magically guess the best way to parallelize C/C++ loops, DSLs separate the Algorithm from the Schedule (how it's executed).

This allows developers to explicitly command the compiler to unroll, vectorize, and tile specific loops for maximum parallel performance onto target Hardware (GPUs/TPUs).

Comdes