The Embedded New Testament

The "Holy Bible" for embedded engineers

Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

ARM Architecture

Understanding ARM Processors for Embedded Systems
Comprehensive coverage of ARM architecture, instruction sets, and programming models

🏗️ ARM Architecture Fundamentals

What is ARM Architecture?

ARM (Advanced RISC Machine) is a family of Reduced Instruction Set Computing (RISC) architectures that has become the dominant choice for embedded systems, mobile devices, and increasingly for servers and desktop computers.

ARM Architecture Characteristics:

RISC Design: Simple, fixed-length instructions
Load-Store Architecture: Separate memory access and computation
Register-Based: Extensive use of general-purpose registers
Scalable: From simple microcontrollers to high-performance processors
Power Efficient: Optimized for low power consumption
Licensed IP: ARM licenses designs to semiconductor companies

ARM vs. Other Architectures

ARM vs. x86:

ARM: RISC, power-efficient, mobile-first
x86: CISC, performance-focused, desktop-first
ARM: Load-store architecture
x86: Complex addressing modes

ARM vs. RISC-V:

ARM: Proprietary, mature ecosystem
RISC-V: Open-source, growing ecosystem
ARM: Extensive toolchain support
RISC-V: Customizable instruction sets

┌─────────────────────────────────────┐
│         ARM Architecture            │
├─────────────────────────────────────┤
│         Application Layer           │
│      (User applications)            │
├─────────────────────────────────────┤
│         System Software             │
│      (OS, drivers, middleware)      │
├─────────────────────────────────────┤
│         ARM Architecture            │
│      (ISA, execution model)        │
├─────────────────────────────────────┤
│         Microarchitecture           │
│      (Pipeline, cache, ALU)        │
├─────────────────────────────────────┤
│         Physical Implementation     │
│      (Silicon, power, timing)      │
└─────────────────────────────────────┘

ARM Architecture Philosophy

ARM follows the efficiency and scalability principle—provide a clean, efficient architecture that can scale from simple microcontrollers to complex multi-core processors while maintaining software compatibility.

ARM Design Goals:

Efficiency: Maximize performance per watt
Scalability: Support various performance levels
Compatibility: Maintain software compatibility across generations
Simplicity: Clean, easy-to-understand architecture
Flexibility: Support various implementation styles

🔧 ARM Programmer Model

Understanding the ARM Programming Interface

The ARM programmer model defines the visible state and behavior that software can observe and control. Understanding this model is essential for writing efficient ARM assembly code and optimizing C programs.

ARM Programmer Model Philosophy

The ARM programmer model follows the simplicity and consistency principle—provide a clean, consistent interface that makes programming straightforward while enabling high performance.

Programmer Model Goals:

Simplicity: Easy to understand and use
Consistency: Uniform behavior across instructions
Efficiency: Enable high-performance implementations
Compatibility: Maintain backward compatibility
Flexibility: Support various programming styles

ARM Registers

General-Purpose Registers:

┌─────────────────────────────────────┐
│         ARM Register Set            │
├─────────────────────────────────────┤
│  R0-R12: General Purpose           │
│  R13 (SP): Stack Pointer           │
│  R14 (LR): Link Register           │
│  R15 (PC): Program Counter         │
├─────────────────────────────────────┤
│  CPSR: Current Program Status      │
│  SPSR: Saved Program Status        │
└─────────────────────────────────────┘

Register Usage Conventions:

; ARM register usage conventions
R0-R3: Function parameters and return values
R4-R11: Local variables (must be preserved)
R12: Intra-procedure call scratch register
R13: Stack pointer (SP)
R14: Link register (LR)
R15: Program counter (PC)

CPSR (Current Program Status Register):

; CPSR bit fields
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
N  Z  C  V  Q  J  RESERVED  GE[3:0]  RESERVED  E  A  I  F  T  M[4:0]

N: Negative flag (result < 0)
Z: Zero flag (result = 0)
C: Carry flag (unsigned overflow)
V: Overflow flag (signed overflow)
I: IRQ disable
F: FIQ disable
T: Thumb state
M: Mode bits

ARM Execution Modes

Processor Modes:

; ARM execution modes
User (USR): Normal program execution
FIQ: Fast interrupt processing
IRQ: Normal interrupt processing
Supervisor (SVC): Protected mode, system calls
Abort (ABT): Memory access violations
Undefined (UND): Undefined instruction
System (SYS): Privileged user mode

Mode Switching:

; Switch to supervisor mode
MSR CPSR_c, #0x13    ; Set mode bits to SVC

; Switch to user mode
MSR CPSR_c, #0x10    ; Set mode bits to USR

; Enable/disable interrupts
CPSID I               ; Disable IRQ
CPSIE I               ; Enable IRQ
CPSID F               ; Disable FIQ
CPSIE F               ; Enable FIQ

📚 Instruction Set Architecture

ARM Instruction Sets

ARM processors support multiple instruction sets optimized for different use cases. Understanding these instruction sets is crucial for writing efficient code and understanding performance characteristics.

ARM Instruction Set Philosophy

ARM instruction sets follow the efficiency and compatibility principle—provide instruction sets that maximize performance and power efficiency while maintaining software compatibility across different ARM implementations.

Instruction Set Goals:

Efficiency: Maximize performance per instruction
Density: Minimize code size
Compatibility: Maintain software compatibility
Flexibility: Support various programming needs
Performance: Enable high-performance implementations

ARM Instruction Sets

ARM (32-bit):

; ARM 32-bit instructions
ADD R0, R1, R2        ; R0 = R1 + R2
SUB R0, R1, #10       ; R0 = R1 - 10
LDR R0, [R1]          ; R0 = *R1
STR R0, [R1]          ; *R1 = R0
MOV R0, #0x100        ; R0 = 0x100
CMP R0, R1             ; Compare R0 and R1
BGT label              ; Branch if greater than

Thumb (16-bit):

; Thumb 16-bit instructions
ADD R0, R1             ; R0 = R0 + R1
SUB R0, #5             ; R0 = R0 - 5
LDR R0, [R1]          ; R0 = *R1
STR R0, [R1]          ; *R1 = R0
MOV R0, #10            ; R0 = 10
CMP R0, R1             ; Compare R0 and R1
BGT label              ; Branch if greater than

Thumb-2 (16/32-bit):

; Thumb-2 instructions (mixed 16/32-bit)
ADD.W R0, R1, R2      ; Wide instruction
ADD R0, R1             ; Narrow instruction
LDR.W R0, [R1, #4]    ; Wide load with offset
LDR R0, [R1]          ; Narrow load

AArch64 (64-bit):

; AArch64 64-bit instructions
ADD X0, X1, X2        ; X0 = X1 + X2
SUB X0, X1, #10       ; X0 = X1 - 10
LDR X0, [X1]          ; X0 = *X1
STR X0, [X1]          ; *X1 = X0
MOV X0, #0x100        ; X0 = 0x100
CMP X0, X1             ; Compare X0 and X1
B.GT label             ; Branch if greater than

ARM Instruction Categories

Data Processing:

; Arithmetic operations
ADD R0, R1, R2        ; Addition
SUB R0, R1, R2        ; Subtraction
MUL R0, R1, R2        ; Multiplication
SDIV R0, R1, R2       ; Signed division
UDIV R0, R1, R2       ; Unsigned division

; Logical operations
AND R0, R1, R2        ; Bitwise AND
ORR R0, R1, R2        ; Bitwise OR
EOR R0, R1, R2        ; Bitwise XOR
BIC R0, R1, R2        ; Bit clear

Memory Operations:

; Load operations
LDR R0, [R1]          ; Load word
LDRB R0, [R1]         ; Load byte
LDRH R0, [R1]         ; Load halfword
LDRD R0, R1, [R2]     ; Load double word

; Store operations
STR R0, [R1]          ; Store word
STRB R0, [R1]         ; Store byte
STRH R0, [R1]         ; Store halfword
STRD R0, R1, [R2]     ; Store double word

Control Flow:

; Branch instructions
B label                ; Unconditional branch
BL label               ; Branch and link (function call)
BX LR                  ; Branch and exchange (return)
BLX R0                 ; Branch, link, and exchange

; Conditional branches
BEQ label              ; Branch if equal
BNE label              ; Branch if not equal
BGT label              ; Branch if greater than
BLT label              ; Branch if less than

🚀 ARM Cortex Series

Understanding ARM Cortex Processors

ARM Cortex processors are implementations of the ARM architecture designed for specific market segments. Each series is optimized for different performance and power requirements.

ARM Cortex Philosophy

ARM Cortex follows the market-specific optimization principle—design processor families that are optimized for specific market segments while maintaining software compatibility.

Cortex Design Goals:

Performance: Maximize performance for target market
Power Efficiency: Optimize power consumption
Cost: Minimize silicon area and cost
Compatibility: Maintain software compatibility
Scalability: Support various performance levels

Cortex Processor Families

Cortex-A Series (Application):

┌─────────────────────────────────────┐
│         Cortex-A Series             │
├─────────────────────────────────────┤
│  Cortex-A7: Entry-level mobile     │
│  Cortex-A53: Mid-range mobile      │
│  Cortex-A72: High-end mobile       │
│  Cortex-A76: Premium mobile        │
│  Cortex-A78: High-performance      │
│  Cortex-X1: Extreme performance    │
└─────────────────────────────────────┘

Cortex-R Series (Real-time):

┌─────────────────────────────────────┐
│         Cortex-R Series             │
├─────────────────────────────────────┤
│  Cortex-R4: Automotive, industrial │
│  Cortex-R5: Safety-critical apps   │
│  Cortex-R7: High-performance RT    │
│  Cortex-R8: Multi-core RT          │
└─────────────────────────────────────┘

Cortex-M Series (Microcontroller):

┌─────────────────────────────────────┐
│         Cortex-M Series             │
├─────────────────────────────────────┤
│  Cortex-M0: Ultra-low power        │
│  Cortex-M3: Mid-range MCU          │
│  Cortex-M4: DSP + MCU              │
│  Cortex-M7: High-performance MCU   │
│  Cortex-M33: TrustZone security    │
│  Cortex-M55: AI acceleration       │
└─────────────────────────────────────┘

Cortex-M Microcontroller Features

Cortex-M0/M0+:

// Cortex-M0 minimal configuration
#include "arm_cm0.h"

// Simple interrupt handler
void SysTick_Handler(void) {
    // Handle system tick
    GPIOA->ODR ^= GPIO_ODR_OD5;  // Toggle LED
}

// Main function
int main(void) {
    // Enable GPIOA clock
    RCC->AHBENR |= RCC_AHBENR_GPIOAEN;
    
    // Configure PA5 as output
    GPIOA->MODER |= GPIO_MODER_MODER5_0;
    
    // Configure SysTick
    SysTick_Config(SystemCoreClock / 1000);
    
    while(1) {
        // Main loop
    }
}

Cortex-M4 with DSP:

// Cortex-M4 DSP operations
#include "arm_math.h"

// FIR filter implementation
void fir_filter(float32_t *input, float32_t *output, 
                float32_t *coeffs, uint32_t length) {
    arm_fir_instance_f32 filter;
    arm_fir_init_f32(&filter, length, coeffs, 
                     filter_state, block_size);
    arm_fir_f32(&filter, input, output, block_size);
}

// FFT computation
void compute_fft(float32_t *input, float32_t *output, 
                 uint32_t fft_length) {
    arm_cfft_instance_f32 fft_instance;
    arm_cfft_init_f32(&fft_instance, fft_length);
    arm_cfft_f32(&fft_instance, input, output, 0, 1);
}

💾 Memory Model

Understanding ARM Memory Architecture

ARM processors use a sophisticated memory model that includes multiple levels of cache, virtual memory support, and various memory ordering guarantees.

ARM Memory Model Philosophy

ARM memory model follows the performance and consistency principle—provide a memory system that maximizes performance while maintaining clear consistency guarantees for software.

Memory Model Goals:

Performance: Maximize memory access speed
Consistency: Clear memory ordering guarantees
Efficiency: Minimize power consumption
Scalability: Support various memory sizes
Compatibility: Maintain software compatibility

Memory Hierarchy

ARM Memory Levels:

┌─────────────────────────────────────┐
│         ARM Memory Hierarchy        │
├─────────────────────────────────────┤
│  CPU Registers (1 cycle)            │
├─────────────────────────────────────┤
│  L1 Cache (2-3 cycles)              │
├─────────────────────────────────────┤
│  L2 Cache (10-20 cycles)            │
├─────────────────────────────────────┤
│  L3 Cache (40-80 cycles)            │
├─────────────────────────────────────┤
│  Main Memory (100-300 cycles)       │
├─────────────────────────────────────┤
│  Storage (millions of cycles)       │
└─────────────────────────────────────┘

Cache Organization:

// Cache line size and organization
#define CACHE_LINE_SIZE 64

// Cache-aligned data structure
typedef struct {
    uint32_t data[CACHE_LINE_SIZE / sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_aligned_t;

// Prefetch data into cache
void prefetch_data(void *ptr) {
    __builtin_prefetch(ptr, 0, 3);  // Read, high locality
}

Memory Ordering

ARM Memory Model:

// Memory barriers for ordering
void memory_barriers_example(void) {
    int data = 1;
    int flag = 0;
    
    // Store data first
    data = 42;
    
    // Ensure data is written before flag
    __sync_synchronize();  // Full memory barrier
    
    // Set flag to indicate data is ready
    flag = 1;
}

// Atomic operations
int atomic_increment(int *ptr) {
    return __sync_fetch_and_add(ptr, 1);
}

int atomic_compare_exchange(int *ptr, int expected, int desired) {
    return __sync_val_compare_and_swap(ptr, expected, desired);
}

⚡ Exception Handling

Understanding ARM Exception Processing

ARM processors use a sophisticated exception handling system that supports multiple exception types, priority levels, and efficient context switching.

ARM Exception Philosophy

ARM exception handling follows the efficiency and reliability principle—provide fast, reliable exception processing that minimizes interrupt latency while maintaining system stability.

Exception Handling Goals:

Speed: Minimize interrupt latency
Reliability: Ensure stable exception processing
Flexibility: Support various exception types
Efficiency: Minimize overhead
Compatibility: Maintain software compatibility

Exception Types

ARM Exception Hierarchy:

┌─────────────────────────────────────┐
│         ARM Exception Types         │
├─────────────────────────────────────┤
│  Reset: Highest priority            │
├─────────────────────────────────────┤
│  Data Abort                         │
├─────────────────────────────────────┤
│  FIQ: Fast Interrupt               │
├─────────────────────────────────────┤
│  IRQ: Normal Interrupt             │
├─────────────────────────────────────┤
│  Prefetch Abort                     │
├─────────────────────────────────────┤
│  Software Interrupt                 │
├─────────────────────────────────────┤
│  Undefined Instruction              │
├─────────────────────────────────────┤
│  System Call                        │
└─────────────────────────────────────┘

Exception Vector Table:

// Exception vector table
typedef void (*exception_handler_t)(void);

// Vector table structure
typedef struct {
    uint32_t stack_pointer;
    exception_handler_t reset_handler;
    exception_handler_t nmi_handler;
    exception_handler_t hardfault_handler;
    exception_handler_t memmanage_handler;
    exception_handler_t busfault_handler;
    exception_handler_t usagefault_handler;
    exception_handler_t reserved[4];
    exception_handler_t svcall_handler;
    exception_handler_t debugmonitor_handler;
    exception_handler_t reserved2;
    exception_handler_t pendsv_handler;
    exception_handler_t systick_handler;
    exception_handler_t irq_handlers[16];
} vector_table_t;

// Vector table placement
__attribute__((section(".isr_vector")))
vector_table_t vector_table = {
    .stack_pointer = (uint32_t)&_estack,
    .reset_handler = reset_handler,
    .nmi_handler = nmi_handler,
    .hardfault_handler = hardfault_handler,
    // ... other handlers
};

Interrupt Handler Implementation:

// IRQ handler
void __attribute__((interrupt("IRQ"))) irq_handler(void) {
    // Get pending interrupt
    uint32_t irq_number = NVIC->IPR[0] & 0xFF;
    
    // Handle specific interrupt
    switch(irq_number) {
        case TIM2_IRQn:
            tim2_handler();
            break;
        case USART1_IRQn:
            usart1_handler();
            break;
        default:
            // Unknown interrupt
            break;
    }
}

// FIQ handler (fastest interrupt)
void __attribute__((interrupt("FIQ"))) fiq_handler(void) {
    // Fast interrupt processing
    // Minimal processing for speed
}

📊 Performance Features

Optimizing ARM Performance

ARM processors include various performance features that can significantly improve application performance when properly utilized.

ARM Performance Philosophy

ARM performance features follow the efficiency and scalability principle—provide features that maximize performance while maintaining power efficiency and scalability across different implementations.

Performance Feature Goals:

Efficiency: Maximize performance per watt
Scalability: Support various performance levels
Compatibility: Maintain software compatibility
Flexibility: Support various optimization strategies
Reliability: Ensure stable performance

Performance Optimization Techniques

Instruction-Level Parallelism:

// Optimize for ARM pipeline
void optimized_loop(int *array, int length) {
    int i;
    // Unroll loop for better pipelining
    for (i = 0; i < length - 3; i += 4) {
        array[i] = array[i] * 2;
        array[i+1] = array[i+1] * 2;
        array[i+2] = array[i+2] * 2;
        array[i+3] = array[i+3] * 2;
    }
    
    // Handle remaining elements
    for (; i < length; i++) {
        array[i] = array[i] * 2;
    }
}

SIMD Instructions:

// Use ARM NEON SIMD
#include <arm_neon.h>

void vector_add(float32_t *a, float32_t *b, float32_t *result, int length) {
    int i;
    // Process 4 elements at once
    for (i = 0; i < length - 3; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vr = vaddq_f32(va, vb);
        vst1q_f32(&result[i], vr);
    }
    
    // Handle remaining elements
    for (; i < length; i++) {
        result[i] = a[i] + b[i];
    }
}

Cache Optimization:

// Cache-friendly data access
void cache_optimized_access(int *matrix, int rows, int cols) {
    int i, j;
    
    // Access data in row-major order (cache-friendly)
    for (i = 0; i < rows; i++) {
        for (j = 0; j < cols; j++) {
            matrix[i * cols + j] = i + j;
        }
    }
}

// Prefetch data
void prefetch_optimized_access(int *array, int length) {
    int i;
    for (i = 0; i < length; i += 16) {
        __builtin_prefetch(&array[i + 16], 0, 3);
        // Process current data
        array[i] = array[i] * 2;
    }
}

🛠️ ARM Development Tools

Tools for ARM Development

ARM development requires specialized tools for compilation, debugging, and performance analysis. Understanding these tools is essential for effective ARM development.

ARM Development Tool Philosophy

ARM development tools follow the efficiency and usability principle—provide tools that make ARM development efficient and accessible while supporting advanced optimization and debugging features.

Development Tool Goals:

Efficiency: Fast compilation and debugging
Usability: Easy to use and understand
Performance: Support for optimization
Debugging: Comprehensive debugging support
Compatibility: Support for various ARM targets

Development Toolchain

ARM Compiler:

# ARM GCC compilation
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mfloat-abi=hard \
    -mfpu=fpv4-sp-d16 -O2 -c main.c -o main.o

# Link with standard library
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mfloat-abi=hard \
    -mfpu=fpv4-sp-d16 -T linker.ld main.o -o main.elf

# Generate binary
arm-none-eabi-objcopy -O binary main.elf main.bin

Linker Script:

/* ARM linker script */
MEMORY
{
    FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
    RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}

SECTIONS
{
    .text : {
        KEEP(*(.isr_vector))
        *(.text*)
        *(.rodata*)
        . = ALIGN(4);
    } > FLASH
    
    .data : {
        _sdata = .;
        *(.data*)
        . = ALIGN(4);
        _edata = .;
    } > RAM AT> FLASH
    
    .bss : {
        _sbss = .;
        *(.bss*)
        *(COMMON)
        . = ALIGN(4);
        _ebss = .;
    } > RAM
}

Debug Configuration:

// VS Code debug configuration
{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "ARM Debug",
            "type": "cppdbg",
            "request": "launch",
            "program": "${workspaceFolder}/build/main.elf",
            "args": [],
            "stopAtEntry": false,
            "cwd": "${workspaceFolder}",
            "environment": [],
            "externalConsole": false,
            "MIMode": "gdb",
            "miDebuggerPath": "arm-none-eabi-gdb",
            "setupCommands": [
                {"text": "target remote localhost:3333"},
                {"text": "monitor reset halt"},
                {"text": "monitor flash write_image erase main.elf"}
            ]
        }
    ]
}

🎯 Conclusion

ARM architecture provides a powerful, efficient foundation for embedded systems. Understanding ARM’s programmer model, instruction sets, and performance features is essential for creating high-performance, power-efficient embedded applications.

Key Takeaways:

ARM architecture is RISC-based with load-store design
Multiple instruction sets support various use cases
Cortex processors are optimized for specific markets
Sophisticated memory model enables high performance
Efficient exception handling minimizes interrupt latency
Performance features enable optimization
Comprehensive toolchain supports development

The Path Forward:

As ARM processors become more prevalent in embedded systems, understanding ARM architecture will become increasingly important. Modern ARM processors continue to evolve, providing new features and performance improvements.

Remember: ARM architecture is not just about understanding instructions—it’s about understanding how to write efficient, reliable code that takes advantage of ARM’s design philosophy and performance features. The skills you develop here will enable you to create high-performance, power-efficient embedded systems.