The "Holy Bible" for embedded engineers
Understanding ARM Processors for Embedded Systems
Comprehensive coverage of ARM architecture, instruction sets, and programming models
ARM (Advanced RISC Machine) is a family of Reduced Instruction Set Computing (RISC) architectures that has become the dominant choice for embedded systems, mobile devices, and increasingly for servers and desktop computers.
ARM Architecture Characteristics:
ARM vs. x86:
ARM vs. RISC-V:
┌─────────────────────────────────────┐
│ ARM Architecture │
├─────────────────────────────────────┤
│ Application Layer │
│ (User applications) │
├─────────────────────────────────────┤
│ System Software │
│ (OS, drivers, middleware) │
├─────────────────────────────────────┤
│ ARM Architecture │
│ (ISA, execution model) │
├─────────────────────────────────────┤
│ Microarchitecture │
│ (Pipeline, cache, ALU) │
├─────────────────────────────────────┤
│ Physical Implementation │
│ (Silicon, power, timing) │
└─────────────────────────────────────┘
ARM follows the efficiency and scalability principle—provide a clean, efficient architecture that can scale from simple microcontrollers to complex multi-core processors while maintaining software compatibility.
ARM Design Goals:
The ARM programmer model defines the visible state and behavior that software can observe and control. Understanding this model is essential for writing efficient ARM assembly code and optimizing C programs.
The ARM programmer model follows the simplicity and consistency principle—provide a clean, consistent interface that makes programming straightforward while enabling high performance.
Programmer Model Goals:
General-Purpose Registers:
┌─────────────────────────────────────┐
│ ARM Register Set │
├─────────────────────────────────────┤
│ R0-R12: General Purpose │
│ R13 (SP): Stack Pointer │
│ R14 (LR): Link Register │
│ R15 (PC): Program Counter │
├─────────────────────────────────────┤
│ CPSR: Current Program Status │
│ SPSR: Saved Program Status │
└─────────────────────────────────────┘
Register Usage Conventions:
; ARM register usage conventions
R0-R3: Function parameters and return values
R4-R11: Local variables (must be preserved)
R12: Intra-procedure call scratch register
R13: Stack pointer (SP)
R14: Link register (LR)
R15: Program counter (PC)
CPSR (Current Program Status Register):
; CPSR bit fields
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
N Z C V Q J RESERVED GE[3:0] RESERVED E A I F T M[4:0]
N: Negative flag (result < 0)
Z: Zero flag (result = 0)
C: Carry flag (unsigned overflow)
V: Overflow flag (signed overflow)
I: IRQ disable
F: FIQ disable
T: Thumb state
M: Mode bits
Processor Modes:
; ARM execution modes
User (USR): Normal program execution
FIQ: Fast interrupt processing
IRQ: Normal interrupt processing
Supervisor (SVC): Protected mode, system calls
Abort (ABT): Memory access violations
Undefined (UND): Undefined instruction
System (SYS): Privileged user mode
Mode Switching:
; Switch to supervisor mode
MSR CPSR_c, #0x13 ; Set mode bits to SVC
; Switch to user mode
MSR CPSR_c, #0x10 ; Set mode bits to USR
; Enable/disable interrupts
CPSID I ; Disable IRQ
CPSIE I ; Enable IRQ
CPSID F ; Disable FIQ
CPSIE F ; Enable FIQ
ARM processors support multiple instruction sets optimized for different use cases. Understanding these instruction sets is crucial for writing efficient code and understanding performance characteristics.
ARM instruction sets follow the efficiency and compatibility principle—provide instruction sets that maximize performance and power efficiency while maintaining software compatibility across different ARM implementations.
Instruction Set Goals:
ARM (32-bit):
; ARM 32-bit instructions
ADD R0, R1, R2 ; R0 = R1 + R2
SUB R0, R1, #10 ; R0 = R1 - 10
LDR R0, [R1] ; R0 = *R1
STR R0, [R1] ; *R1 = R0
MOV R0, #0x100 ; R0 = 0x100
CMP R0, R1 ; Compare R0 and R1
BGT label ; Branch if greater than
Thumb (16-bit):
; Thumb 16-bit instructions
ADD R0, R1 ; R0 = R0 + R1
SUB R0, #5 ; R0 = R0 - 5
LDR R0, [R1] ; R0 = *R1
STR R0, [R1] ; *R1 = R0
MOV R0, #10 ; R0 = 10
CMP R0, R1 ; Compare R0 and R1
BGT label ; Branch if greater than
Thumb-2 (16/32-bit):
; Thumb-2 instructions (mixed 16/32-bit)
ADD.W R0, R1, R2 ; Wide instruction
ADD R0, R1 ; Narrow instruction
LDR.W R0, [R1, #4] ; Wide load with offset
LDR R0, [R1] ; Narrow load
AArch64 (64-bit):
; AArch64 64-bit instructions
ADD X0, X1, X2 ; X0 = X1 + X2
SUB X0, X1, #10 ; X0 = X1 - 10
LDR X0, [X1] ; X0 = *X1
STR X0, [X1] ; *X1 = X0
MOV X0, #0x100 ; X0 = 0x100
CMP X0, X1 ; Compare X0 and X1
B.GT label ; Branch if greater than
Data Processing:
; Arithmetic operations
ADD R0, R1, R2 ; Addition
SUB R0, R1, R2 ; Subtraction
MUL R0, R1, R2 ; Multiplication
SDIV R0, R1, R2 ; Signed division
UDIV R0, R1, R2 ; Unsigned division
; Logical operations
AND R0, R1, R2 ; Bitwise AND
ORR R0, R1, R2 ; Bitwise OR
EOR R0, R1, R2 ; Bitwise XOR
BIC R0, R1, R2 ; Bit clear
Memory Operations:
; Load operations
LDR R0, [R1] ; Load word
LDRB R0, [R1] ; Load byte
LDRH R0, [R1] ; Load halfword
LDRD R0, R1, [R2] ; Load double word
; Store operations
STR R0, [R1] ; Store word
STRB R0, [R1] ; Store byte
STRH R0, [R1] ; Store halfword
STRD R0, R1, [R2] ; Store double word
Control Flow:
; Branch instructions
B label ; Unconditional branch
BL label ; Branch and link (function call)
BX LR ; Branch and exchange (return)
BLX R0 ; Branch, link, and exchange
; Conditional branches
BEQ label ; Branch if equal
BNE label ; Branch if not equal
BGT label ; Branch if greater than
BLT label ; Branch if less than
ARM Cortex processors are implementations of the ARM architecture designed for specific market segments. Each series is optimized for different performance and power requirements.
ARM Cortex follows the market-specific optimization principle—design processor families that are optimized for specific market segments while maintaining software compatibility.
Cortex Design Goals:
Cortex-A Series (Application):
┌─────────────────────────────────────┐
│ Cortex-A Series │
├─────────────────────────────────────┤
│ Cortex-A7: Entry-level mobile │
│ Cortex-A53: Mid-range mobile │
│ Cortex-A72: High-end mobile │
│ Cortex-A76: Premium mobile │
│ Cortex-A78: High-performance │
│ Cortex-X1: Extreme performance │
└─────────────────────────────────────┘
Cortex-R Series (Real-time):
┌─────────────────────────────────────┐
│ Cortex-R Series │
├─────────────────────────────────────┤
│ Cortex-R4: Automotive, industrial │
│ Cortex-R5: Safety-critical apps │
│ Cortex-R7: High-performance RT │
│ Cortex-R8: Multi-core RT │
└─────────────────────────────────────┘
Cortex-M Series (Microcontroller):
┌─────────────────────────────────────┐
│ Cortex-M Series │
├─────────────────────────────────────┤
│ Cortex-M0: Ultra-low power │
│ Cortex-M3: Mid-range MCU │
│ Cortex-M4: DSP + MCU │
│ Cortex-M7: High-performance MCU │
│ Cortex-M33: TrustZone security │
│ Cortex-M55: AI acceleration │
└─────────────────────────────────────┘
Cortex-M0/M0+:
// Cortex-M0 minimal configuration
#include "arm_cm0.h"
// Simple interrupt handler
void SysTick_Handler(void) {
// Handle system tick
GPIOA->ODR ^= GPIO_ODR_OD5; // Toggle LED
}
// Main function
int main(void) {
// Enable GPIOA clock
RCC->AHBENR |= RCC_AHBENR_GPIOAEN;
// Configure PA5 as output
GPIOA->MODER |= GPIO_MODER_MODER5_0;
// Configure SysTick
SysTick_Config(SystemCoreClock / 1000);
while(1) {
// Main loop
}
}
Cortex-M4 with DSP:
// Cortex-M4 DSP operations
#include "arm_math.h"
// FIR filter implementation
void fir_filter(float32_t *input, float32_t *output,
float32_t *coeffs, uint32_t length) {
arm_fir_instance_f32 filter;
arm_fir_init_f32(&filter, length, coeffs,
filter_state, block_size);
arm_fir_f32(&filter, input, output, block_size);
}
// FFT computation
void compute_fft(float32_t *input, float32_t *output,
uint32_t fft_length) {
arm_cfft_instance_f32 fft_instance;
arm_cfft_init_f32(&fft_instance, fft_length);
arm_cfft_f32(&fft_instance, input, output, 0, 1);
}
ARM processors use a sophisticated memory model that includes multiple levels of cache, virtual memory support, and various memory ordering guarantees.
ARM memory model follows the performance and consistency principle—provide a memory system that maximizes performance while maintaining clear consistency guarantees for software.
Memory Model Goals:
ARM Memory Levels:
┌─────────────────────────────────────┐
│ ARM Memory Hierarchy │
├─────────────────────────────────────┤
│ CPU Registers (1 cycle) │
├─────────────────────────────────────┤
│ L1 Cache (2-3 cycles) │
├─────────────────────────────────────┤
│ L2 Cache (10-20 cycles) │
├─────────────────────────────────────┤
│ L3 Cache (40-80 cycles) │
├─────────────────────────────────────┤
│ Main Memory (100-300 cycles) │
├─────────────────────────────────────┤
│ Storage (millions of cycles) │
└─────────────────────────────────────┘
Cache Organization:
// Cache line size and organization
#define CACHE_LINE_SIZE 64
// Cache-aligned data structure
typedef struct {
uint32_t data[CACHE_LINE_SIZE / sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_aligned_t;
// Prefetch data into cache
void prefetch_data(void *ptr) {
__builtin_prefetch(ptr, 0, 3); // Read, high locality
}
ARM Memory Model:
// Memory barriers for ordering
void memory_barriers_example(void) {
int data = 1;
int flag = 0;
// Store data first
data = 42;
// Ensure data is written before flag
__sync_synchronize(); // Full memory barrier
// Set flag to indicate data is ready
flag = 1;
}
// Atomic operations
int atomic_increment(int *ptr) {
return __sync_fetch_and_add(ptr, 1);
}
int atomic_compare_exchange(int *ptr, int expected, int desired) {
return __sync_val_compare_and_swap(ptr, expected, desired);
}
ARM processors use a sophisticated exception handling system that supports multiple exception types, priority levels, and efficient context switching.
ARM exception handling follows the efficiency and reliability principle—provide fast, reliable exception processing that minimizes interrupt latency while maintaining system stability.
Exception Handling Goals:
ARM Exception Hierarchy:
┌─────────────────────────────────────┐
│ ARM Exception Types │
├─────────────────────────────────────┤
│ Reset: Highest priority │
├─────────────────────────────────────┤
│ Data Abort │
├─────────────────────────────────────┤
│ FIQ: Fast Interrupt │
├─────────────────────────────────────┤
│ IRQ: Normal Interrupt │
├─────────────────────────────────────┤
│ Prefetch Abort │
├─────────────────────────────────────┤
│ Software Interrupt │
├─────────────────────────────────────┤
│ Undefined Instruction │
├─────────────────────────────────────┤
│ System Call │
└─────────────────────────────────────┘
Exception Vector Table:
// Exception vector table
typedef void (*exception_handler_t)(void);
// Vector table structure
typedef struct {
uint32_t stack_pointer;
exception_handler_t reset_handler;
exception_handler_t nmi_handler;
exception_handler_t hardfault_handler;
exception_handler_t memmanage_handler;
exception_handler_t busfault_handler;
exception_handler_t usagefault_handler;
exception_handler_t reserved[4];
exception_handler_t svcall_handler;
exception_handler_t debugmonitor_handler;
exception_handler_t reserved2;
exception_handler_t pendsv_handler;
exception_handler_t systick_handler;
exception_handler_t irq_handlers[16];
} vector_table_t;
// Vector table placement
__attribute__((section(".isr_vector")))
vector_table_t vector_table = {
.stack_pointer = (uint32_t)&_estack,
.reset_handler = reset_handler,
.nmi_handler = nmi_handler,
.hardfault_handler = hardfault_handler,
// ... other handlers
};
Interrupt Handler Implementation:
// IRQ handler
void __attribute__((interrupt("IRQ"))) irq_handler(void) {
// Get pending interrupt
uint32_t irq_number = NVIC->IPR[0] & 0xFF;
// Handle specific interrupt
switch(irq_number) {
case TIM2_IRQn:
tim2_handler();
break;
case USART1_IRQn:
usart1_handler();
break;
default:
// Unknown interrupt
break;
}
}
// FIQ handler (fastest interrupt)
void __attribute__((interrupt("FIQ"))) fiq_handler(void) {
// Fast interrupt processing
// Minimal processing for speed
}
ARM processors include various performance features that can significantly improve application performance when properly utilized.
ARM performance features follow the efficiency and scalability principle—provide features that maximize performance while maintaining power efficiency and scalability across different implementations.
Performance Feature Goals:
Instruction-Level Parallelism:
// Optimize for ARM pipeline
void optimized_loop(int *array, int length) {
int i;
// Unroll loop for better pipelining
for (i = 0; i < length - 3; i += 4) {
array[i] = array[i] * 2;
array[i+1] = array[i+1] * 2;
array[i+2] = array[i+2] * 2;
array[i+3] = array[i+3] * 2;
}
// Handle remaining elements
for (; i < length; i++) {
array[i] = array[i] * 2;
}
}
SIMD Instructions:
// Use ARM NEON SIMD
#include <arm_neon.h>
void vector_add(float32_t *a, float32_t *b, float32_t *result, int length) {
int i;
// Process 4 elements at once
for (i = 0; i < length - 3; i += 4) {
float32x4_t va = vld1q_f32(&a[i]);
float32x4_t vb = vld1q_f32(&b[i]);
float32x4_t vr = vaddq_f32(va, vb);
vst1q_f32(&result[i], vr);
}
// Handle remaining elements
for (; i < length; i++) {
result[i] = a[i] + b[i];
}
}
Cache Optimization:
// Cache-friendly data access
void cache_optimized_access(int *matrix, int rows, int cols) {
int i, j;
// Access data in row-major order (cache-friendly)
for (i = 0; i < rows; i++) {
for (j = 0; j < cols; j++) {
matrix[i * cols + j] = i + j;
}
}
}
// Prefetch data
void prefetch_optimized_access(int *array, int length) {
int i;
for (i = 0; i < length; i += 16) {
__builtin_prefetch(&array[i + 16], 0, 3);
// Process current data
array[i] = array[i] * 2;
}
}
ARM development requires specialized tools for compilation, debugging, and performance analysis. Understanding these tools is essential for effective ARM development.
ARM development tools follow the efficiency and usability principle—provide tools that make ARM development efficient and accessible while supporting advanced optimization and debugging features.
Development Tool Goals:
ARM Compiler:
# ARM GCC compilation
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mfloat-abi=hard \
-mfpu=fpv4-sp-d16 -O2 -c main.c -o main.o
# Link with standard library
arm-none-eabi-gcc -mcpu=cortex-m4 -mthumb -mfloat-abi=hard \
-mfpu=fpv4-sp-d16 -T linker.ld main.o -o main.elf
# Generate binary
arm-none-eabi-objcopy -O binary main.elf main.bin
Linker Script:
/* ARM linker script */
MEMORY
{
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K
}
SECTIONS
{
.text : {
KEEP(*(.isr_vector))
*(.text*)
*(.rodata*)
. = ALIGN(4);
} > FLASH
.data : {
_sdata = .;
*(.data*)
. = ALIGN(4);
_edata = .;
} > RAM AT> FLASH
.bss : {
_sbss = .;
*(.bss*)
*(COMMON)
. = ALIGN(4);
_ebss = .;
} > RAM
}
Debug Configuration:
// VS Code debug configuration
{
"version": "0.2.0",
"configurations": [
{
"name": "ARM Debug",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/build/main.elf",
"args": [],
"stopAtEntry": false,
"cwd": "${workspaceFolder}",
"environment": [],
"externalConsole": false,
"MIMode": "gdb",
"miDebuggerPath": "arm-none-eabi-gdb",
"setupCommands": [
{"text": "target remote localhost:3333"},
{"text": "monitor reset halt"},
{"text": "monitor flash write_image erase main.elf"}
]
}
]
}
ARM architecture provides a powerful, efficient foundation for embedded systems. Understanding ARM’s programmer model, instruction sets, and performance features is essential for creating high-performance, power-efficient embedded applications.
Key Takeaways:
The Path Forward:
As ARM processors become more prevalent in embedded systems, understanding ARM architecture will become increasingly important. Modern ARM processors continue to evolve, providing new features and performance improvements.
Remember: ARM architecture is not just about understanding instructions—it’s about understanding how to write efficient, reliable code that takes advantage of ARM’s design philosophy and performance features. The skills you develop here will enable you to create high-performance, power-efficient embedded systems.