The "Holy Bible" for embedded engineers
Understanding Instruction Execution Pipelining
Comprehensive coverage of pipeline stages, hazards, forwarding, and pipeline optimization techniques
Pipeline architecture is a fundamental technique in modern processors that divides instruction execution into multiple stages, allowing multiple instructions to be processed simultaneously. This approach significantly improves instruction throughput by overlapping the execution of different instructions, effectively reducing the average time per instruction.
The pipeline concept is inspired by assembly line manufacturing, where different workers perform different tasks on different products simultaneously. In a CPU pipeline, each stage performs a specific function on a different instruction, creating a continuous flow of instruction processing.
The pipeline philosophy centers around the principle of instruction-level parallelism (ILP), where multiple instructions can be in different stages of execution at the same time. This approach provides several key benefits:
Traditional Sequential Execution:
Instruction 1: [Fetch] → [Decode] → [Execute] → [Memory] → [Writeback]
Instruction 2: [Fetch] → [Decode] → [Execute] → [Memory] → [Writeback]
Instruction 3: [Fetch] → [Decode] → [Execute] → [Memory] → [Writeback]
Pipelined Execution:
Cycle 1: [Fetch 1]
Cycle 2: [Decode 1] [Fetch 2]
Cycle 3: [Execute 1] [Decode 2] [Fetch 3]
Cycle 4: [Memory 1] [Execute 2] [Decode 3] [Fetch 4]
Cycle 5: [Writeback 1] [Memory 2] [Execute 3] [Decode 4] [Fetch 5]
This overlapping execution allows the pipeline to complete one instruction per cycle after the initial fill period, significantly improving overall performance.
The classic RISC pipeline consists of five main stages, each performing a specific function:
Detailed Pipeline Stage Operations:
┌─────────────────────────────────────────────────────────────────┐
│ Fetch Stage (F) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ PC Update │ Instruction │ Next PC Calculation │ │
│ │ │ Fetch │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Decode Stage (D) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Instruction │ Register │ Control Signal │ │
│ │ Decode │ Read │ Generation │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Execute Stage (E) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ALU │ Address │ Branch Target │ │
│ │ Operation │ Calculation │ Calculation │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Memory Stage (M) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Load Data │ Store Data │ Memory Access │ │
│ │ Read │ Write │ Control │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Writeback Stage (W) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Register │ Result │ Status Update │ │
│ │ Write │ Selection │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Pipeline registers store the results of each stage and pass them to the next stage. These registers maintain the pipeline state and ensure proper data flow between stages.
Pipeline Register Organization:
┌─────────────────────────────────────────────────────────────────┐
│ F/D Register (Fetch/Decode) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Instruction │ PC Value │ Branch Prediction │ │
│ │ Word │ │ Information │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ D/E Register (Decode/Execute) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Decoded │ Register │ Control │ │
│ │ Instruction│ Values │ Signals │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ E/M Register (Execute/Memory) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ALU Result │ Branch │ Memory │ │
│ │ │ Target │ Control │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ M/W Register (Memory/Writeback) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Memory │ ALU Result │ Write │ │
│ │ Data │ │ Control │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Pipeline hazards are situations that prevent the pipeline from operating at full efficiency, causing stalls or incorrect execution. Hazards occur when the pipeline cannot proceed with the next instruction due to dependencies or resource conflicts.
There are three main types of pipeline hazards: structural hazards, data hazards, and control hazards. Each type requires different handling strategies and can significantly impact pipeline performance.
Structural hazards occur when hardware resources are not available to support the desired pipeline operation. This typically happens when multiple instructions need the same functional unit simultaneously.
Structural Hazard Example:
Cycle 1: [Fetch 1] [Decode 1] [Execute 1] [Memory 1] [Writeback 1]
Cycle 2: [Fetch 2] [Decode 2] [Execute 2] [Memory 2] [Writeback 2]
Cycle 3: [Fetch 3] [Decode 3] [Execute 3] [Memory 3] [Writeback 3]
Cycle 4: [Fetch 4] [Decode 4] [Execute 4] [Memory 4] [Writeback 4]
Cycle 5: [Fetch 5] [Decode 5] [Execute 5] [Memory 5] [Writeback 5]
If Memory stage is busy:
Cycle 6: [Fetch 6] [Decode 6] [Execute 6] [Stall] [Writeback 5]
Cycle 7: [Fetch 7] [Decode 7] [Execute 7] [Memory 6] [Stall]
Common causes of structural hazards include:
Data hazards occur when an instruction depends on the result of a previous instruction that hasn’t completed yet. There are three types of data hazards:
Data Hazard Example (RAW):
add r1, r2, r3 # r1 = r2 + r3
sub r4, r1, r5 # r4 = r1 - r5 (depends on r1 from previous instruction)
Pipeline Execution:
Cycle 1: [Fetch add] [Decode] [Execute] [Memory] [Writeback]
Cycle 2: [Fetch sub] [Decode] [Execute] [Memory] [Writeback]
Cycle 3: [Fetch] [Decode sub] [Execute] [Memory] [Writeback]
Cycle 4: [Fetch] [Decode] [Execute sub] [Memory] [Writeback]
Cycle 5: [Fetch] [Decode] [Execute] [Memory sub] [Writeback]
Cycle 6: [Fetch] [Decode] [Execute] [Memory] [Writeback sub]
Problem: sub needs r1 in cycle 3, but add doesn't write r1 until cycle 5
Control hazards occur when the pipeline encounters a branch instruction and must decide which instruction to fetch next. Since the branch target is not known until the execute stage, the pipeline may fetch the wrong instructions.
Control Hazard Example:
beq r1, r2, target # Branch if r1 == r2
add r3, r4, r5 # This instruction may be wrong if branch is taken
sub r6, r7, r8 # This instruction may be wrong if branch is taken
Pipeline Execution:
Cycle 1: [Fetch beq] [Decode] [Execute] [Memory] [Writeback]
Cycle 2: [Fetch add] [Decode] [Execute] [Memory] [Writeback]
Cycle 3: [Fetch sub] [Decode] [Execute] [Memory] [Writeback]
Cycle 4: [Fetch] [Decode] [Execute] [Memory] [Writeback]
Cycle 5: [Fetch] [Decode] [Execute] [Memory] [Writeback]
If branch is taken, add and sub were wrong instructions
Data forwarding (also called bypassing) is a technique that allows the result of an instruction to be forwarded to a subsequent instruction before it reaches the writeback stage. This technique can eliminate many data hazards by providing data as soon as it’s available.
Forwarding works by detecting when an instruction needs a value that is being computed by a previous instruction in the pipeline. Instead of waiting for the value to be written back to the register file, the pipeline forwards it directly from the appropriate pipeline stage.
Forwarding Paths in Pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ Pipeline Stages │
│ ┌─────────┬─────────┬─────────┬─────────┬─────────────────────┐ │
│ │ Fetch │ Decode │Execute │ Memory │ Writeback │ │
│ └─────────┴─────────┴─────────┴─────────┴─────────────────────┘ │
│ ↑ ↑ ↑ ↑ │
│ │ │ │ │ │
│ └─────────┼─────────┼─────────┘ │
│ │ │ │
│ └─────────┼─────────────────────────────────┘ │
│ │ │
│ ┌─────────┴─────────────────────────────────┐ │
│ │ Forwarding Paths │ │
│ │ ┌─────────┬─────────┬─────────────────────┐ │
│ │ │ E→D │ M→D │ W→D │ │
│ │ │ E→E │ M→E │ W→E │ │
│ │ │ E→M │ M→M │ W→M │ │
│ └─┴─────────┴─────────┴─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The forwarding logic must detect when forwarding is needed and select the correct data source. This involves comparing register numbers and determining the timing of data availability.
Forwarding Detection Logic:
if (EX/MEM.RegWrite and EX/MEM.RegisterRd ≠ 0 and
EX/MEM.RegisterRd = ID/EX.RegisterRs) then
ForwardA = "10" // Forward from EX/MEM
if (MEM/WB.RegWrite and MEM/WB.RegisterRd ≠ 0 and
MEM/WB.RegisterRd = ID/EX.RegisterRs) then
ForwardA = "01" // Forward from MEM/WB
if (EX/MEM.RegWrite and EX/MEM.RegisterRd ≠ 0 and
EX/MEM.RegisterRd = ID/EX.RegisterRt) then
ForwardB = "10" // Forward from EX/MEM
if (MEM/WB.RegWrite and MEM/WB.RegisterRd ≠ 0 and
MEM/WB.RegisterRd = ID/EX.RegisterRt) then
ForwardB = "01" // Forward from MEM/WB
While forwarding can eliminate many data hazards, it cannot solve all problems. Some hazards require pipeline stalling, particularly when the needed data is not yet available in any pipeline stage.
Forwarding Cannot Solve:
load r1, 0(r2) # Load from memory address in r2
add r3, r1, r4 # Add r1 and r4, store in r3
Pipeline Execution:
Cycle 1: [Fetch load] [Decode] [Execute] [Memory] [Writeback]
Cycle 2: [Fetch add] [Decode] [Execute] [Memory] [Writeback]
Cycle 3: [Fetch] [Decode add] [Execute] [Memory] [Writeback]
Cycle 4: [Fetch] [Decode] [Execute add] [Memory] [Writeback]
Cycle 5: [Fetch] [Decode] [Execute] [Memory add] [Writeback]
Problem: add needs r1 in cycle 3, but load doesn't read r1 until cycle 4
Solution: Stall add for one cycle
Pipeline stalling is a technique used when hazards cannot be resolved through forwarding. During a stall, the pipeline stops fetching new instructions and holds existing instructions in their current stages until the hazard is resolved.
Stalling is necessary when:
Stall Implementation:
Normal Pipeline:
Cycle 1: [Fetch 1] [Decode 1] [Execute 1] [Memory 1] [Writeback 1]
Cycle 2: [Fetch 2] [Decode 2] [Execute 2] [Memory 2] [Writeback 2]
Cycle 3: [Fetch 3] [Decode 3] [Execute 3] [Memory 3] [Writeback 3]
Stalled Pipeline:
Cycle 1: [Fetch 1] [Decode 1] [Execute 1] [Memory 1] [Writeback 1]
Cycle 2: [Fetch 2] [Decode 2] [Execute 2] [Memory 2] [Writeback 2]
Cycle 3: [Stall] [Decode 3] [Execute 3] [Memory 3] [Writeback 3]
Cycle 4: [Fetch 4] [Decode 4] [Execute 4] [Memory 4] [Writeback 4]
The hazard detection unit monitors the pipeline for potential hazards and generates control signals to handle them appropriately. It works in conjunction with the forwarding unit to minimize pipeline stalls.
Hazard Detection Unit:
┌─────────────────────────────────────────────────────────────────┐
│ Hazard Detection Unit │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Load/Use │ Branch │ Stall │ │
│ │ Detection │ Detection │ Generation │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Stall IF/ID │ Stall ID/EX │ Flush Pipeline │ │
│ │ Register │ Register │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Superscalar pipelines can issue and execute multiple instructions per cycle, further increasing instruction-level parallelism. This approach requires multiple functional units and sophisticated instruction scheduling.
Superscalar Pipeline Example:
Cycle 1: [Fetch 1,2] [Decode 1,2] [Execute 1,2] [Memory 1,2] [Writeback 1,2]
Cycle 2: [Fetch 3,4] [Decode 3,4] [Execute 3,4] [Memory 3,4] [Writeback 3,4]
Cycle 3: [Fetch 5,6] [Decode 5,6] [Execute 5,6] [Memory 5,6] [Writeback 5,6]
Out-of-order execution allows instructions to execute as soon as their operands are available, regardless of their original program order. This technique can significantly improve performance by hiding memory latencies and maximizing resource utilization.
Out-of-Order Execution Example:
Original Order: load r1, 0(r2) # Load from memory
add r3, r4, r5 # Independent add
sub r6, r1, r7 # Depends on load result
Out-of-Order Execution:
Cycle 1: [Fetch all] [Decode all]
Cycle 2: [Execute add] [Execute load] [Decode sub]
Cycle 3: [Memory add] [Memory load] [Execute sub]
Cycle 4: [Writeback add] [Writeback load] [Memory sub]
Cycle 5: [Writeback sub]
Register renaming is a technique used in out-of-order pipelines to eliminate false dependencies. It allows the hardware to use different physical registers for the same architectural register, enabling more parallel execution.
Register Renaming Example:
Original Code:
add r1, r2, r3 # r1 = r2 + r3
sub r1, r1, r4 # r1 = r1 - r4
mul r5, r1, r6 # r5 = r1 * r6
Renamed Code:
add p1, r2, r3 # p1 = r2 + r3
sub p2, p1, r4 # p2 = p1 - r4
mul p3, p2, r6 # p3 = p2 * r6
Now sub and mul can execute in parallel since they use different physical registers
Pipeline performance is measured by several key metrics:
Pipeline Performance:
Ideal Pipeline (no hazards):
- Throughput = 1 instruction/cycle
- Latency = N cycles (where N is number of pipeline stages)
Pipeline with Hazards:
- Throughput = Instructions / Total Cycles
- Efficiency = Useful Cycles / Total Cycles
Example:
10 instructions, 15 cycles due to hazards
- Throughput = 10/15 = 0.67 instructions/cycle
- Efficiency = 10/15 = 67%
Cycles Per Instruction (CPI) is a key metric for pipeline performance analysis. It represents the average number of cycles required to complete an instruction.
CPI Calculation:
CPI = Base CPI + Pipeline Stalls + Memory Stalls + Branch Stalls
Where:
- Base CPI = 1 for ideal pipeline
- Pipeline Stalls = Stalls due to data hazards
- Memory Stalls = Stalls due to cache misses
- Branch Stalls = Stalls due to branch mispredictions
Example:
CPI = 1 + 0.2 + 0.1 + 0.15 = 1.45 cycles per instruction
Embedded systems often have real-time requirements that place constraints on pipeline design. Predictable timing is often more important than maximum throughput, requiring careful consideration of worst-case execution time (WCET).
Embedded systems must balance performance with power consumption. Techniques like clock gating, pipeline stage disabling, and dynamic voltage scaling can significantly reduce power consumption while maintaining acceptable performance.
Embedded applications often have different characteristics than general-purpose applications:
These characteristics can influence pipeline design decisions and optimization strategies.
// Simplified pipeline hazard detection
typedef struct {
int rs, rt, rd;
int opcode;
int reg_write;
} instruction_t;
typedef struct {
instruction_t inst;
int rs_value, rt_value;
int alu_result;
int mem_data;
} pipeline_stage_t;
// Hazard detection logic
int detect_data_hazard(pipeline_stage_t* ex_mem, pipeline_stage_t* id_ex) {
if (ex_mem->inst.reg_write && ex_mem->inst.rd != 0) {
if (ex_mem->inst.rd == id_ex->inst.rs ||
ex_mem->inst.rd == id_ex->inst.rt) {
return 1; // Data hazard detected
}
}
return 0; // No hazard
}
// Forwarding logic
int get_forwarded_value(pipeline_stage_t* ex_mem, pipeline_stage_t* mem_wb,
int reg_num, int current_value) {
if (ex_mem->inst.reg_write && ex_mem->inst.rd == reg_num) {
return ex_mem->alu_result; // Forward from EX/MEM
}
if (mem_wb->inst.reg_write && mem_wb->inst.rd == reg_num) {
return mem_wb->mem_data; // Forward from MEM/WB
}
return current_value; // No forwarding needed
}
// Simple pipeline simulator
typedef struct {
int pc;
int instruction;
int rs, rt, rd;
int rs_value, rt_value;
int alu_result;
int mem_data;
int stage; // 0=F, 1=D, 2=E, 3=M, 4=W
} pipeline_instruction_t;
void simulate_pipeline(pipeline_instruction_t* pipeline, int* memory, int* registers) {
// Simulate one pipeline cycle
for (int i = 4; i > 0; i--) {
pipeline[i] = pipeline[i-1]; // Advance instructions
}
// Fetch new instruction
pipeline[0].instruction = memory[pipeline[0].pc];
pipeline[0].pc += 4;
// Execute each stage
for (int i = 0; i < 5; i++) {
switch (pipeline[i].stage) {
case 1: // Decode
decode_instruction(&pipeline[i]);
pipeline[i].rs_value = registers[pipeline[i].rs];
pipeline[i].rt_value = registers[pipeline[i].rt];
break;
case 2: // Execute
pipeline[i].alu_result = execute_alu(pipeline[i]);
break;
case 3: // Memory
if (is_load_store(pipeline[i])) {
pipeline[i].mem_data = memory[pipeline[i].alu_result];
}
break;
case 4: // Writeback
if (pipeline[i].rd != 0) {
registers[pipeline[i].rd] = pipeline[i].alu_result;
}
break;
}
}
}
// Simple branch predictor
typedef struct {
int history[256]; // Branch history table
int pattern[256]; // Pattern history table
} branch_predictor_t;
int predict_branch(branch_predictor_t* predictor, int pc) {
int index = pc & 0xFF;
int history = predictor->history[index];
// Simple 2-bit saturating counter
if (history >= 2) {
return 1; // Predict taken
} else {
return 0; // Predict not taken
}
}
void update_predictor(branch_predictor_t* predictor, int pc, int taken, int actual) {
int index = pc & 0xFF;
int history = predictor->history[index];
if (actual) {
if (history < 3) history++;
} else {
if (history > 0) history--;
}
predictor->history[index] = history;
}
Several tools can help visualize and analyze pipeline behavior:
Performance profiling tools can help identify pipeline bottlenecks:
Pipeline debugging involves:
Understand Your Target Architecture: Different processors have different pipeline characteristics and optimization strategies.
Profile Before Optimizing: Use profiling tools to identify actual pipeline bottlenecks rather than optimizing based on assumptions.
Consider the Full System: Pipeline performance depends on the interaction between hardware, compiler, and application software.
Balance Complexity and Performance: Complex optimizations may not always provide significant performance improvements.
Minimize Data Dependencies: Structure code to minimize dependencies between consecutive instructions.
Optimize Memory Access Patterns: Use cache-friendly access patterns to minimize memory stalls.
Reduce Branch Frequency: Minimize conditional branches and use branch prediction-friendly code structures.
Leverage Compiler Optimizations: Modern compilers can perform many pipeline optimizations automatically.
Consider Real-Time Requirements: Ensure that pipeline optimizations don’t compromise timing predictability.
Balance Power and Performance: Choose optimization strategies that provide the best power-performance trade-off for your application.
Understand Hardware Constraints: Work within the limitations of your target hardware.
Future pipeline developments may include:
New technologies may impact pipeline design:
Future systems may involve closer collaboration between software and hardware design:
This comprehensive guide to Pipeline Architecture provides the foundation for understanding how modern processors achieve high performance through instruction-level parallelism. The concepts covered here are essential for embedded software engineers working with performance-critical applications and understanding processor behavior.