The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

CPU Architecture

Understanding Processor Design and Instruction Sets
Comprehensive coverage of CPU architectures, instruction sets, and design principles


📋 Table of Contents


🏗️ CPU Architecture Fundamentals

What is CPU Architecture?

CPU architecture defines the design and organization of a processor, including its instruction set, register organization, memory model, and execution pipeline. Understanding CPU architecture is essential for optimizing software performance and system design.

CPU Architecture Characteristics:

CPU Architecture Philosophy

CPU architecture follows the performance and efficiency principle—design processors that maximize performance while maintaining power efficiency, cost-effectiveness, and software compatibility.

Architecture Design Goals:


🔧 ARM Architecture

Advanced RISC Machine Design

ARM architecture is a family of RISC processors designed for efficiency and scalability. ARM processors dominate mobile and embedded markets due to their power efficiency and performance characteristics.

ARM Design Philosophy

ARM follows the efficiency and scalability principle—create processors that maximize performance per watt while supporting various performance levels from simple microcontrollers to high-performance application processors.

ARM Design Goals:

ARM Architecture Features

RISC Design Principles:

; ARM RISC characteristics
; Fixed-length instructions (32-bit ARM, 16-bit Thumb)
ADD R0, R1, R2        ; R0 = R1 + R2
SUB R0, R1, #10       ; R0 = R1 - 10
LDR R0, [R1]          ; Load from memory
STR R0, [R1]          ; Store to memory

; Load-store architecture
MOV R0, #0x100        ; Immediate value to register
LDR R1, [R0]          ; Load from memory address in R0
ADD R2, R1, #5        ; Add immediate to loaded value
STR R2, [R0]          ; Store result back to memory

Register Organization:

; ARM register set
R0-R12: General Purpose Registers
R13 (SP): Stack Pointer
R14 (LR): Link Register
R15 (PC): Program Counter

; Special registers
CPSR: Current Program Status Register
SPSR: Saved Program Status Register

; Register usage conventions
R0-R3: Function parameters and return values
R4-R11: Local variables (must be preserved)
R12: Intra-procedure call scratch register

ARM Instruction Sets:

; ARM (32-bit) instructions
ADD R0, R1, R2        ; 32-bit addition
LDR R0, [R1, #4]      ; Load with offset
STR R0, [R1, #4]      ; Store with offset
BL function_name       ; Branch and link

; Thumb (16-bit) instructions
ADD R0, R1             ; 16-bit addition
LDR R0, [R1]           ; 16-bit load
STR R0, [R1]           ; 16-bit store
B label                ; 16-bit branch

; Thumb-2 (mixed 16/32-bit)
ADD.W R0, R1, R2      ; Wide instruction
ADD R0, R1             ; Narrow instruction

🖥️ x86 Architecture

Complex Instruction Set Computing

x86 architecture represents the dominant desktop and server processor family, known for its complex instruction set and backward compatibility. x86 processors excel at complex workloads and high-performance computing.

x86 Design Philosophy

x86 follows the compatibility and performance principle—maintain backward compatibility while providing high performance for complex workloads and supporting legacy software.

x86 Design Goals:

x86 Architecture Features

CISC Design Principles:

; x86 CISC characteristics
; Variable-length instructions
ADD EAX, EBX          ; Add registers
ADD EAX, [EBX]        ; Add memory to register
ADD EAX, [EBX + 4]    ; Add memory with offset
ADD EAX, 100          ; Add immediate

; Complex addressing modes
MOV EAX, [EBX + ESI*4 + 8]  ; Base + index*scale + offset
LEA EAX, [EBX + ESI*4 + 8]  ; Load effective address

; String operations
MOVSB                  ; Move string byte
STOSB                  ; Store string byte
LODSB                  ; Load string byte

Register Organization:

; x86 register set
EAX, EBX, ECX, EDX: General Purpose (32-bit)
ESI, EDI: Source/Destination Index
ESP: Stack Pointer
EBP: Base Pointer
EIP: Instruction Pointer

; Segment registers (16-bit)
CS: Code Segment
DS: Data Segment
SS: Stack Segment
ES, FS, GS: Extra Segments

; Extended registers (64-bit)
RAX, RBX, RCX, RDX: 64-bit general purpose
R8-R15: Additional 64-bit registers

x86 Instruction Sets:

; 32-bit x86 instructions
MOV EAX, EBX          ; Move register to register
ADD EAX, [EBX]        ; Add memory to register
CALL function_name     ; Call function
RET                    ; Return from function

; 64-bit x86-64 instructions
MOV RAX, RBX          ; 64-bit move
ADD RAX, [RBX]        ; 64-bit add
CALL function_name     ; 64-bit call
RET                    ; 64-bit return

; SIMD instructions
MOVDQA XMM0, [RAX]    ; Move aligned data
PADDB XMM0, XMM1       ; Packed byte addition
PADDD XMM0, XMM1       ; Packed doubleword addition

🚀 RISC-V Architecture

Open Source RISC Architecture

RISC-V is an open-source RISC architecture that provides a clean, modern design with extensibility for custom instruction sets. RISC-V is gaining popularity in embedded systems and research applications.

RISC-V Design Philosophy

RISC-V follows the openness and extensibility principle—provide a clean, open architecture that enables customization and innovation while maintaining simplicity and efficiency.

RISC-V Design Goals:

RISC-V Architecture Features

RISC Design Principles:

; RISC-V RISC characteristics
; Fixed-length instructions (32-bit)
ADD x1, x2, x3        ; x1 = x2 + x3
SUB x1, x2, x3        ; x1 = x2 - x3
LW x1, 4(x2)          ; Load word from memory
SW x1, 4(x2)          ; Store word to memory

; Load-store architecture
LI x1, 0x100          ; Load immediate
LW x2, 0(x1)          ; Load from memory
ADD x3, x2, 5         ; Add immediate
SW x3, 0(x1)          ; Store to memory

Register Organization:

; RISC-V register set
x0: Zero register (always 0)
x1: Return address
x2: Stack pointer
x3: Global pointer
x4: Thread pointer
x5-x7: Temporaries
x8: Frame pointer
x9: Saved register
x10-x11: Function arguments/return values
x12-x17: Function arguments
x18-x27: Saved registers
x28-x31: Temporaries

RISC-V Instruction Sets:

; Base integer instructions (RV32I)
ADD x1, x2, x3        ; Addition
SUB x1, x2, x3        ; Subtraction
AND x1, x2, x3        ; Bitwise AND
OR x1, x2, x3         ; Bitwise OR
XOR x1, x2, x3        ; Bitwise XOR

; Memory instructions
LW x1, offset(x2)     ; Load word
SW x1, offset(x2)     ; Store word
LH x1, offset(x2)     ; Load halfword
SH x1, offset(x2)     ; Store halfword

; Control flow
BEQ x1, x2, label     ; Branch if equal
BNE x1, x2, label     ; Branch if not equal
JAL x1, label         ; Jump and link
JALR x1, x2, offset   ; Jump and link register

📚 Instruction Set Design

Understanding Instruction Set Architecture

Instruction Set Architecture (ISA) defines the interface between software and hardware. Understanding ISA design is essential for optimizing software performance and understanding processor capabilities.

ISA Design Philosophy

ISA design follows the abstraction and efficiency principle—provide a clean abstraction of processor capabilities while enabling efficient hardware implementation and software optimization.

ISA Design Goals:

Instruction Types

Data Processing Instructions:

; Arithmetic operations
ADD R0, R1, R2        ; Addition
SUB R0, R1, R2        ; Subtraction
MUL R0, R1, R2        ; Multiplication
DIV R0, R1, R2        ; Division

; Logical operations
AND R0, R1, R2        ; Bitwise AND
OR R0, R1, R2         ; Bitwise OR
XOR R0, R1, R2        ; Bitwise XOR
NOT R0, R1            ; Bitwise NOT

; Shift operations
LSL R0, R1, #5        ; Logical shift left
LSR R0, R1, #5        ; Logical shift right
ASR R0, R1, #5        ; Arithmetic shift right
ROR R0, R1, #5        ; Rotate right

Memory Instructions:

; Load instructions
LDR R0, [R1]          ; Load word
LDRB R0, [R1]         ; Load byte
LDRH R0, [R1]         ; Load halfword
LDRD R0, R1, [R2]     ; Load double word

; Store instructions
STR R0, [R1]          ; Store word
STRB R0, [R1]         ; Store byte
STRH R0, [R1]         ; Store halfword
STRD R0, R1, [R2]     ; Store double word

; Addressing modes
LDR R0, [R1]          ; Base addressing
LDR R0, [R1, #4]      ; Base + offset
LDR R0, [R1, R2]      ; Base + index
LDR R0, [R1, R2, LSL #2] ; Base + scaled index

Control Flow Instructions:

; Branch instructions
B label                ; Unconditional branch
BEQ label              ; Branch if equal
BNE label              ; Branch if not equal
BGT label              ; Branch if greater than
BLT label              ; Branch if less than

; Function calls
BL function_name       ; Branch and link
BX LR                  ; Branch and exchange
BLX R0                 ; Branch, link, and exchange

; Conditional execution
ADDEQ R0, R1, R2      ; Add if equal
SUBNE R0, R1, R2      ; Subtract if not equal
MOVEQ R0, #5           ; Move if equal

🔄 Pipeline Architecture

Understanding Instruction Pipelining

Instruction pipelining is a technique that overlaps instruction execution to improve performance. Understanding pipeline architecture is essential for optimizing software and understanding performance characteristics.

Pipeline Design Philosophy

Pipeline design follows the parallelism and efficiency principle—overlap instruction execution stages to maximize throughput while maintaining correct execution and handling dependencies.

Pipeline Design Goals:

Pipeline Stages

Basic Pipeline Stages:

┌─────────────────────────────────────┐
│         Basic Pipeline              │
├─────────────────────────────────────┤
│  Fetch: Get instruction from memory │
├─────────────────────────────────────┤
│  Decode: Decode instruction        │
├─────────────────────────────────────┤
│  Execute: Perform operation        │
├─────────────────────────────────────┤
│  Memory: Access memory if needed   │
├─────────────────────────────────────┤
│  Writeback: Write result to register│
└─────────────────────────────────────┘

Pipeline Implementation:

// Simple pipeline simulation
typedef struct {
    uint32_t pc;           // Program counter
    uint32_t instruction;  // Current instruction
    uint32_t opcode;       // Decoded opcode
    uint32_t operand1;     // First operand
    uint32_t operand2;     // Second operand
    uint32_t result;       // Execution result
    bool valid;            // Stage valid
} pipeline_stage_t;

// Pipeline structure
typedef struct {
    pipeline_stage_t fetch;
    pipeline_stage_t decode;
    pipeline_stage_t execute;
    pipeline_stage_t memory;
    pipeline_stage_t writeback;
} pipeline_t;

// Pipeline execution
void execute_pipeline_cycle(pipeline_t *pipeline) {
    // Writeback stage
    if (pipeline->writeback.valid) {
        // Write result to register file
        write_register(pipeline->writeback.result);
        pipeline->writeback.valid = false;
    }
    
    // Memory stage
    if (pipeline->memory.valid) {
        pipeline->writeback = pipeline->memory;
        pipeline->memory.valid = false;
    }
    
    // Execute stage
    if (pipeline->execute.valid) {
        pipeline->memory = pipeline->execute;
        pipeline->memory.valid = true;
    }
    
    // Decode stage
    if (pipeline->decode.valid) {
        pipeline->execute = pipeline->decode;
        pipeline->execute.valid = true;
    }
    
    // Fetch stage
    if (pipeline->fetch.valid) {
        pipeline->decode = pipeline->fetch;
        pipeline->decode.valid = true;
    }
    
    // Fetch new instruction
    pipeline->fetch.instruction = fetch_instruction(pipeline->fetch.pc);
    pipeline->fetch.pc += 4;
    pipeline->fetch.valid = true;
}

Pipeline Hazards

Data Hazards:

// Data hazard detection
bool detect_data_hazard(pipeline_t *pipeline, uint32_t rs, uint32_t rt) {
    // Check if source registers are being written by previous instructions
    if (pipeline->execute.valid && 
        (pipeline->execute.opcode == OP_ADD || 
         pipeline->execute.opcode == OP_SUB)) {
        if (rs == pipeline->execute.operand1 || 
            rt == pipeline->execute.operand1) {
            return true;  // Data hazard detected
        }
    }
    
    if (pipeline->memory.valid && 
        (pipeline->memory.opcode == OP_ADD || 
         pipeline->memory.opcode == OP_SUB)) {
        if (rs == pipeline->memory.operand1 || 
            rt == pipeline->memory.operand1) {
            return true;  // Data hazard detected
        }
    }
    
    return false;  // No data hazard
}

// Hazard resolution with forwarding
void resolve_data_hazard(pipeline_t *pipeline, uint32_t *rs_value, uint32_t *rt_value) {
    // Forward from execute stage
    if (pipeline->execute.valid && 
        pipeline->execute.opcode == OP_ADD) {
        if (pipeline->execute.operand1 == rs_value) {
            *rs_value = pipeline->execute.result;
        }
        if (pipeline->execute.operand1 == rt_value) {
            *rt_value = pipeline->execute.result;
        }
    }
    
    // Forward from memory stage
    if (pipeline->memory.valid && 
        pipeline->memory.opcode == OP_ADD) {
        if (pipeline->memory.operand1 == rs_value) {
            *rs_value = pipeline->memory.result;
        }
        if (pipeline->memory.operand1 == rt_value) {
            *rt_value = pipeline->memory.result;
        }
    }
}

Control Hazards:

// Control hazard handling
void handle_control_hazard(pipeline_t *pipeline, uint32_t target_pc) {
    // Flush pipeline on branch
    pipeline->fetch.valid = false;
    pipeline->decode.valid = false;
    pipeline->execute.valid = false;
    
    // Set new program counter
    pipeline->fetch.pc = target_pc;
    
    // Restart pipeline
    pipeline->fetch.valid = true;
}

// Branch prediction
bool predict_branch(uint32_t pc, uint32_t target) {
    // Simple static prediction: always predict taken for backward branches
    return target < pc;
}

📊 Performance Features

Understanding CPU Performance

Modern CPUs include various performance features that can significantly improve application performance. Understanding these features is essential for optimizing software and system design.

Performance Feature Philosophy

Performance features follow the optimization and efficiency principle—provide hardware mechanisms that improve performance while maintaining power efficiency and system stability.

Performance Feature Goals:

Performance Optimization Techniques

Instruction-Level Parallelism:

// Superscalar execution simulation
typedef struct {
    pipeline_t *pipelines;     // Multiple pipelines
    uint32_t num_pipelines;    // Number of pipelines
    uint32_t current_pipeline; // Current pipeline index
} superscalar_pipeline_t;

// Execute multiple instructions per cycle
void execute_superscalar_cycle(superscalar_pipeline_t *superscalar) {
    for (uint32_t i = 0; i < superscalar->num_pipelines; i++) {
        if (can_issue_instruction(i)) {
            execute_pipeline_cycle(&superscalar->pipelines[i]);
        }
    }
}

// Check if instruction can be issued
bool can_issue_instruction(uint32_t pipeline_id) {
    // Check for resource conflicts
    if (has_resource_conflict(pipeline_id)) {
        return false;
    }
    
    // Check for data dependencies
    if (has_data_dependency(pipeline_id)) {
        return false;
    }
    
    return true;
}

Out-of-Order Execution:

// Out-of-order execution simulation
typedef struct {
    uint32_t instruction_id;   // Unique instruction identifier
    uint32_t opcode;           // Instruction opcode
    uint32_t operand1;         // First operand
    uint32_t operand2;         // Second operand
    uint32_t result;           // Execution result
    bool ready;                // Ready for execution
    bool completed;            // Execution completed
} instruction_t;

// Instruction reorder buffer
typedef struct {
    instruction_t *instructions;    // Instruction buffer
    uint32_t head;                 // Head pointer
    uint32_t tail;                 // Tail pointer
    uint32_t size;                 // Buffer size
} reorder_buffer_t;

// Execute instructions out of order
void execute_out_of_order(reorder_buffer_t *rob) {
    for (uint32_t i = 0; i < rob->size; i++) {
        if (rob->instructions[i].ready && !rob->instructions[i].completed) {
            // Execute instruction
            execute_instruction(&rob->instructions[i]);
            rob->instructions[i].completed = true;
        }
    }
}

SIMD Instructions:

// SIMD vector operations
typedef struct {
    uint32_t data[4];  // 4-element vector
} vector_t;

// Vector addition
vector_t vector_add(vector_t a, vector_t b) {
    vector_t result;
    for (int i = 0; i < 4; i++) {
        result.data[i] = a.data[i] + b.data[i];
    }
    return result;
}

// Vector multiplication
vector_t vector_mul(vector_t a, vector_t b) {
    vector_t result;
    for (int i = 0; i < 4; i++) {
        result.data[i] = a.data[i] * b.data[i];
    }
    return result;
}

// SIMD matrix multiplication
void matrix_multiply_simd(float *a, float *b, float *result, int size) {
    for (int i = 0; i < size; i += 4) {
        // Load vectors
        vector_t va = load_vector(&a[i]);
        vector_t vb = load_vector(&b[i]);
        
        // Perform vector operations
        vector_t vr = vector_add(va, vb);
        
        // Store result
        store_vector(&result[i], vr);
    }
}

🔍 Architecture Comparison

Comparing CPU Architectures

Different CPU architectures have different strengths and trade-offs. Understanding these differences is essential for selecting the right architecture for specific applications.

Architecture Comparison Philosophy

Architecture comparison follows the analysis and selection principle—understand the strengths and weaknesses of different architectures to make informed decisions for specific applications.

Comparison Goals:

Performance Comparison

Performance Metrics:

// Performance measurement structure
typedef struct {
    uint64_t instructions_executed;    // Total instructions
    uint64_t cycles_consumed;          // Total cycles
    uint64_t cache_misses;             // Cache miss count
    uint64_t branch_mispredictions;    // Branch misprediction count
    double cpi;                        // Cycles per instruction
    double cache_miss_rate;            // Cache miss rate
    double branch_misprediction_rate;  // Branch misprediction rate
} performance_metrics_t;

// Calculate performance metrics
void calculate_performance_metrics(performance_metrics_t *metrics) {
    metrics->cpi = (double)metrics->cycles_consumed / metrics->instructions_executed;
    metrics->cache_miss_rate = (double)metrics->cache_misses / metrics->instructions_executed;
    metrics->branch_misprediction_rate = (double)metrics->branch_mispredictions / metrics->instructions_executed;
}

// Compare architectures
void compare_architectures(performance_metrics_t *arm, 
                          performance_metrics_t *x86, 
                          performance_metrics_t *riscv) {
    printf("Architecture Performance Comparison:\n");
    printf("ARM:     CPI=%.2f, Cache Miss=%.2f%%, Branch Miss=%.2f%%\n",
           arm->cpi, arm->cache_miss_rate * 100, arm->branch_misprediction_rate * 100);
    printf("x86:     CPI=%.2f, Cache Miss=%.2f%%, Branch Miss=%.2f%%\n",
           x86->cpi, x86->cache_miss_rate * 100, x86->branch_misprediction_rate * 100);
    printf("RISC-V:  CPI=%.2f, Cache Miss=%.2f%%, Branch Miss=%.2f%%\n",
           riscv->cpi, riscv->cache_miss_rate * 100, riscv->branch_misprediction_rate * 100);
}

Power Efficiency:

// Power efficiency comparison
typedef struct {
    double performance;     // Performance score
    double power_consumption; // Power consumption (W)
    double efficiency;      // Performance per watt
} power_efficiency_t;

// Calculate power efficiency
void calculate_power_efficiency(power_efficiency_t *metrics) {
    metrics->efficiency = metrics->performance / metrics->power_consumption;
}

// Compare power efficiency
void compare_power_efficiency(power_efficiency_t *arm, 
                             power_efficiency_t *x86, 
                             power_efficiency_t *riscv) {
    printf("Power Efficiency Comparison:\n");
    printf("ARM:     Performance=%.2f, Power=%.2fW, Efficiency=%.2f\n",
           arm->performance, arm->power_consumption, arm->efficiency);
    printf("x86:     Performance=%.2f, Power=%.2fW, Efficiency=%.2f\n",
           x86->performance, x86->power_consumption, x86->efficiency);
    printf("RISC-V:  Performance=%.2f, Power=%.2fW, Efficiency=%.2f\n",
           riscv->performance, riscv->power_consumption, riscv->efficiency);
}

🎯 Conclusion

CPU architecture is fundamental to understanding computer systems and optimizing software performance. Understanding different architectures, instruction sets, and performance features is essential for creating efficient embedded systems.

Key Takeaways:

The Path Forward:

As CPU architectures continue to evolve, understanding their design principles and performance characteristics will become increasingly important. Modern architectures continue to push the boundaries of performance and efficiency.

Remember: CPU architecture is not just about understanding hardware—it’s about understanding how to write efficient software that takes advantage of architectural features. The skills you develop here will enable you to create high-performance, efficient embedded systems.