The "Holy Bible" for embedded engineers
Cache-aware programming optimizes code to work efficiently with CPU cache memory, improving performance by reducing cache misses and maximizing cache utilization. In embedded systems, understanding cache behavior is crucial for real-time performance and power efficiency.
Cache-aware programming is a technique that optimizes code to work efficiently with the CPU’s cache memory hierarchy. It involves understanding how caches work and designing algorithms and data structures to minimize cache misses and maximize cache hits.
Modern CPUs are much faster than memory. The cache acts as a high-speed buffer between the CPU and main memory, significantly reducing memory access latency.
Memory Hierarchy Performance:
┌─────────────────┬─────────────┬─────────────────┐
│ Level │ Latency │ Size │
├─────────────────┼─────────────┼─────────────────┤
│ L1 Cache │ 1-3 ns │ 32-64 KB │
│ L2 Cache │ 10-20 ns │ 256-512 KB │
│ L3 Cache │ 40-80 ns │ 4-32 MB │
│ Main Memory │ 100-300 ns│ 4-32 GB │
│ Disk/Flash │ 10-100 μs │ 100+ GB │
└─────────────────┴─────────────┴─────────────────┘
High Impact Scenarios:
Low Impact Scenarios:
A cache is a small, fast memory that stores frequently accessed data. When the CPU needs data, it first checks the cache. If the data is found (cache hit), it’s retrieved quickly. If not found (cache miss), the data must be fetched from slower memory.
Cache Structure:
┌─────────────────────────────────────────────────────────────┐
│ Cache Memory │
├─────────────────┬─────────────────┬─────────────────┬───────┤
│ Cache Line 0 │ Cache Line 1 │ Cache Line 2 │ ... │
│ ┌─────────────┐│ ┌─────────────┐│ ┌─────────────┐│ │
│ │ Tag ││ │ Tag ││ │ Tag ││ │
│ │ Data ││ │ Data ││ │ Data ││ │
│ │ Valid ││ │ Valid ││ │ Valid ││ │
│ └─────────────┘│ └─────────────┘│ └─────────────┘│ │
└─────────────────┴─────────────────┴─────────────────┴───────┘
A cache line is the smallest unit of data that can be transferred between cache and memory. Typical cache line sizes are 32, 64, or 128 bytes.
Cache Line Structure:
┌─────────────────────────────────────────────────────────────┐
│ Cache Line (64 bytes) │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│ Byte 0 │ Byte 1 │ Byte 2 │ ... │ Byte 62│ Byte 63 │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
Cache Hit:
Cache Miss:
When a cache miss occurs and the cache is full, some data must be evicted to make room for new data.
Common Replacement Policies:
Modern processors have multiple levels of cache, each with different characteristics:
Cache Hierarchy:
┌─────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌─────────────────┬─────────────────┬─────────────────┐ │
│ │ L1 Data │ L1 Instruction│ L1 Unified │ │
│ │ Cache │ Cache │ Cache │ │
│ │ (32-64 KB) │ (32-64 KB) │ (32-64 KB) │ │
│ └─────────────────┴─────────────────┴─────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ L2 Cache │
│ (256-512 KB) │
├─────────────────────────────────────────────────────────────┤
│ L3 Cache │
│ (4-32 MB) │
├─────────────────────────────────────────────────────────────┤
│ Main Memory │
│ (4-32 GB) │
└─────────────────────────────────────────────────────────────┘
// Cache configuration for ARM Cortex-M7
typedef struct {
uint32_t l1_data_size; // L1 Data Cache size
uint32_t l1_instruction_size; // L1 Instruction Cache size
uint32_t l2_size; // L2 Cache size
uint32_t cache_line_size; // Cache line size
uint32_t associativity; // Cache associativity
} cache_config_t;
cache_config_t get_cache_config(void) {
cache_config_t config = {0};
// Read cache configuration from CPU
uint32_t ctr = __builtin_arm_mrc(15, 0, 0, 0, 1);
config.cache_line_size = 4 << ((ctr >> 16) & 0xF);
config.l1_data_size = 4 << ((ctr >> 6) & 0x7);
config.l1_instruction_size = 4 << ((ctr >> 0) & 0x7);
return config;
}
// Cache line aligned data structure
#define CACHE_LINE_SIZE 64
typedef struct {
uint8_t data[CACHE_LINE_SIZE];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_line_t;
// Array of cache-line aligned data
typedef struct {
cache_line_t lines[100];
} cache_aligned_array_t;
// Ensure data fits in cache lines
typedef struct {
uint32_t value1;
uint32_t value2;
char padding[CACHE_LINE_SIZE - 8]; // Padding to next cache line
} __attribute__((aligned(CACHE_LINE_SIZE))) separated_data_t;
Spatial Locality: The tendency to access data that is close together in memory.
Example of Spatial Locality:
┌─────────────────────────────────────────────────────────────┐
│ Memory Array │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│ Data[0]│ Data[1]│ Data[2]│ Data[3]│ Data[4]│ Data[5] │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
↑
Sequential access pattern
(Good spatial locality)
Temporal Locality: The tendency to access the same data repeatedly over time.
Example of Temporal Locality:
for (int i = 0; i < 1000; i++) {
sum += data[i]; // data[i] accessed multiple times
}
Hit Rate: Percentage of memory accesses that result in cache hits
Hit Rate = (Cache Hits) / (Cache Hits + Cache Misses) × 100%
Miss Rate: Percentage of memory accesses that result in cache misses
Miss Rate = (Cache Misses) / (Cache Hits + Cache Misses) × 100%
Average Memory Access Time (AMAT):
AMAT = Hit Time + Miss Rate × Miss Penalty
Sequential access patterns are cache-friendly because they exhibit good spatial locality.
Sequential Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│ Memory Layout │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│ Data[0]│ Data[1]│ Data[2]│ Data[3]│ Data[4]│ Data[5] │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
↑ ↑ ↑ ↑ ↑ ↑
Access 1 Access 2 Access 3 Access 4 Access 5 Access 6
Benefits:
Random access patterns can cause cache misses and poor performance.
Random Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│ Memory Layout │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│ Data[0]│ Data[1]│ Data[2]│ Data[3]│ Data[4]│ Data[5] │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
↑ ↑ ↑ ↑ ↑
Access 1 Access 2 Access 3 Access 4 Access 5
Challenges:
Strided access patterns access data with a fixed stride (step size).
Strided Access Pattern (stride = 2):
┌─────────────────────────────────────────────────────────────┐
│ Memory Layout │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│ Data[0]│ Data[1]│ Data[2]│ Data[3]│ Data[4]│ Data[5] │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
↑ ↑ ↑
Access 1 Access 2 Access 3
Optimization Strategies:
Align data structures to cache line boundaries to avoid cache line splits.
// Optimize data structures for cache
typedef struct {
uint32_t frequently_accessed; // Hot data
uint32_t rarely_accessed; // Cold data
char padding[CACHE_LINE_SIZE - 8]; // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;
// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
uint32_t x, y, z;
} point_aos_t;
point_aos_t points_aos[1000]; // Array of structures
// SoA - better cache locality
typedef struct {
uint32_t x[1000];
uint32_t y[1000];
uint32_t z[1000];
} points_soa_t;
points_soa_t points_soa; // Structure of arrays
Use padding to separate frequently accessed data and prevent false sharing.
// Cache line padding to prevent false sharing
typedef struct {
uint32_t counter;
char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;
// Optimize data structures for cache
typedef struct {
uint32_t frequently_accessed; // Hot data
uint32_t rarely_accessed; // Cold data
char padding[CACHE_LINE_SIZE - 8]; // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;
// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
uint32_t x, y, z;
} point_aos_t;
point_aos_t points_aos[1000]; // Array of structures
// SoA - better cache locality
typedef struct {
uint32_t x[1000];
uint32_t y[1000];
uint32_t z[1000];
} points_soa_t;
points_soa_t points_soa; // Structure of arrays
// Cache line padding to prevent false sharing
typedef struct {
uint32_t counter;
char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;
// Array of padded counters for multi-threaded access
padded_counter_t counters[NUM_THREADS];
// Optimized sequential access
void optimized_sequential_access(uint32_t* data, size_t size) {
// Process data in cache-line sized blocks
const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
for (size_t i = 0; i < size; i += block_size) {
size_t end = (i + block_size < size) ? i + block_size : size;
// Process block
for (size_t j = i; j < end; j++) {
data[j] = process_data(data[j]);
}
}
}
// Optimized strided access
void optimized_strided_access(uint32_t* data, size_t size, size_t stride) {
// Use blocking for strided access
const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
for (size_t block_start = 0; block_start < size; block_start += block_size * stride) {
for (size_t offset = 0; offset < stride; offset++) {
for (size_t i = block_start + offset; i < size; i += stride) {
if (i < block_start + block_size * stride) {
data[i] = process_data(data[i]);
}
}
}
}
}
False sharing occurs when two or more threads access different variables that happen to be on the same cache line, causing unnecessary cache invalidations.
False Sharing Example:
┌─────────────────────────────────────────────────────────────┐
│ Cache Line │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│ Thread 1 │ Thread 2 │ Thread 3 │ Padding │
│ Counter │ Counter │ Counter │ │
└─────────────┴─────────────┴─────────────┴───────────────────┘
// Prevent false sharing with padding
typedef struct {
uint32_t counter;
char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;
// Array of padded counters
padded_counter_t counters[NUM_THREADS];
Prefetching is a technique that loads data into cache before it’s needed, reducing cache misses.
// Software prefetching example
void prefetch_example(uint32_t* data, size_t size) {
for (size_t i = 0; i < size; i++) {
// Prefetch next cache line
if (i + CACHE_LINE_SIZE/sizeof(uint32_t) < size) {
__builtin_prefetch(&data[i + CACHE_LINE_SIZE/sizeof(uint32_t)], 0, 3);
}
// Process current data
data[i] = process_data(data[i]);
}
}
// Cache flush and invalidate functions
void cache_flush(void* addr, size_t size) {
// Flush cache lines containing the address range
uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
__builtin_arm_dccmvac((void*)addr);
}
}
void cache_invalidate(void* addr, size_t size) {
// Invalidate cache lines containing the address range
uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
__builtin_arm_dccimvac((void*)addr);
}
}
In multi-core systems, each core has its own cache, and cache coherency protocols ensure data consistency.
// Multi-core cache-aware data structure
typedef struct {
uint32_t data[NUM_CORES][CACHE_LINE_SIZE/sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_aligned_data_t;
// Cache performance profiling
void profile_cache_performance(void) {
// Start profiling
uint64_t start_cycles = __builtin_readcyclecounter();
// Perform cache-intensive operation
cache_intensive_operation();
// End profiling
uint64_t end_cycles = __builtin_readcyclecounter();
uint64_t cycles = end_cycles - start_cycles;
printf("Operation took %llu cycles\n", cycles);
}
Problem: Data structures not aligned to cache lines Solution: Use cache line alignment and padding
// Incorrect: Not aligned
typedef struct {
uint32_t a, b, c;
} unaligned_struct_t;
// Correct: Aligned to cache line
typedef struct {
uint32_t a, b, c;
char padding[CACHE_LINE_SIZE - 12];
} __attribute__((aligned(CACHE_LINE_SIZE))) aligned_struct_t;
Problem: Multiple threads accessing different variables on same cache line Solution: Use padding and alignment
// Incorrect: Potential false sharing
uint32_t counters[NUM_THREADS];
// Correct: Padded to prevent false sharing
typedef struct {
uint32_t counter;
char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} padded_counter_t;
padded_counter_t counters[NUM_THREADS];
Problem: Random or strided access patterns Solution: Optimize data layout and access patterns
// Incorrect: Poor access pattern
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
data[j][i] = process(data[j][i]); // Column-major access
}
}
// Correct: Better access pattern
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
data[i][j] = process(data[i][j]); // Row-major access
}
}
Problem: Assuming cache size is larger than actual Solution: Query cache configuration and design accordingly
// Query cache configuration
cache_config_t config = get_cache_config();
printf("L1 Data Cache: %u KB\n", config.l1_data_size);
printf("Cache Line Size: %u bytes\n", config.cache_line_size);
Next Steps: Explore Memory Management to understand memory allocation strategies, or dive into Performance Optimization for broader performance techniques.