The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Cache-Aware Programming

📋 Table of Contents

🎯 Overview

Cache-aware programming optimizes code to work efficiently with CPU cache memory, improving performance by reducing cache misses and maximizing cache utilization. In embedded systems, understanding cache behavior is crucial for real-time performance and power efficiency.

Key Concepts for Embedded Development

🤔 What is Cache-Aware Programming?

Cache-aware programming is a technique that optimizes code to work efficiently with the CPU’s cache memory hierarchy. It involves understanding how caches work and designing algorithms and data structures to minimize cache misses and maximize cache hits.

Core Principles

  1. Spatial Locality: Accessing data that is close together in memory
  2. Temporal Locality: Reusing recently accessed data
  3. Cache Line Awareness: Understanding cache line boundaries and alignment
  4. Memory Access Patterns: Optimizing sequential vs. random access patterns
  5. Data Structure Design: Designing structures for cache-friendly access

Why Cache Performance Matters

Modern CPUs are much faster than memory. The cache acts as a high-speed buffer between the CPU and main memory, significantly reducing memory access latency.

Memory Hierarchy Performance:
┌─────────────────┬─────────────┬─────────────────┐
│     Level       │   Latency   │     Size        │
├─────────────────┼─────────────┼─────────────────┤
│   L1 Cache      │   1-3 ns    │   32-64 KB      │
│   L2 Cache      │   10-20 ns  │   256-512 KB    │
│   L3 Cache      │   40-80 ns  │   4-32 MB       │
│   Main Memory   │   100-300 ns│   4-32 GB       │
│   Disk/Flash    │   10-100 μs │   100+ GB       │
└─────────────────┴─────────────┴─────────────────┘

🎯 Why is Cache Performance Important?

Performance Impact

  1. Speed: Cache hits are 10-100x faster than memory accesses
  2. Power Efficiency: Cache accesses consume less power than memory accesses
  3. Real-time Performance: Predictable cache behavior improves real-time performance
  4. Scalability: Cache-efficient code scales better with larger datasets

Real-world Impact

When Cache Optimization Matters

High Impact Scenarios:

Low Impact Scenarios:

🧠 How CPU Caches Work

Cache Basics

A cache is a small, fast memory that stores frequently accessed data. When the CPU needs data, it first checks the cache. If the data is found (cache hit), it’s retrieved quickly. If not found (cache miss), the data must be fetched from slower memory.

Cache Organization

Cache Structure:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Memory                             │
├─────────────────┬─────────────────┬─────────────────┬───────┤
│   Cache Line 0  │   Cache Line 1  │   Cache Line 2  │  ...  │
│  ┌─────────────┐│  ┌─────────────┐│  ┌─────────────┐│       │
│  │   Tag       ││  │   Tag       ││  │   Tag       ││       │
│  │   Data      ││  │   Data      ││  │   Data      ││       │
│  │   Valid     ││  │   Valid     ││  │   Valid     ││       │
│  └─────────────┘│  └─────────────┘│  └─────────────┘│       │
└─────────────────┴─────────────────┴─────────────────┴───────┘

Cache Line Concept

A cache line is the smallest unit of data that can be transferred between cache and memory. Typical cache line sizes are 32, 64, or 128 bytes.

Cache Line Structure:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Line (64 bytes)                   │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Byte 0 │  Byte 1 │  Byte 2 │  ...    │  Byte 62│  Byte 63  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘

Cache Hit vs Cache Miss

Cache Hit:

  1. CPU requests data at address X
  2. Cache controller checks if data is in cache
  3. Data found in cache (hit)
  4. Data returned to CPU quickly (1-3 cycles)

Cache Miss:

  1. CPU requests data at address X
  2. Cache controller checks if data is in cache
  3. Data not found in cache (miss)
  4. Cache line containing address X is fetched from memory
  5. Data is stored in cache and returned to CPU (100+ cycles)

Cache Replacement Policies

When a cache miss occurs and the cache is full, some data must be evicted to make room for new data.

Common Replacement Policies:

  1. LRU (Least Recently Used): Evict least recently accessed data
  2. FIFO (First In, First Out): Evict oldest data
  3. Random: Randomly select data to evict
  4. LFU (Least Frequently Used): Evict least frequently accessed data

🏗️ Cache Architecture

Cache Hierarchy

Modern processors have multiple levels of cache, each with different characteristics:

Cache Hierarchy:
┌─────────────────────────────────────────────────────────────┐
│                    CPU Core                                 │
│  ┌─────────────────┬─────────────────┬─────────────────┐   │
│  │   L1 Data       │   L1 Instruction│   L1 Unified    │   │
│  │   Cache         │   Cache         │   Cache         │   │
│  │   (32-64 KB)    │   (32-64 KB)    │   (32-64 KB)    │   │
│  └─────────────────┴─────────────────┴─────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│                    L2 Cache                                │
│                  (256-512 KB)                              │
├─────────────────────────────────────────────────────────────┤
│                    L3 Cache                                │
│                   (4-32 MB)                                │
├─────────────────────────────────────────────────────────────┤
│                    Main Memory                             │
│                   (4-32 GB)                                │
└─────────────────────────────────────────────────────────────┘

Cache Configuration

// Cache configuration for ARM Cortex-M7
typedef struct {
    uint32_t l1_data_size;      // L1 Data Cache size
    uint32_t l1_instruction_size; // L1 Instruction Cache size
    uint32_t l2_size;            // L2 Cache size
    uint32_t cache_line_size;    // Cache line size
    uint32_t associativity;      // Cache associativity
} cache_config_t;

cache_config_t get_cache_config(void) {
    cache_config_t config = {0};
    
    // Read cache configuration from CPU
    uint32_t ctr = __builtin_arm_mrc(15, 0, 0, 0, 1);
    
    config.cache_line_size = 4 << ((ctr >> 16) & 0xF);
    config.l1_data_size = 4 << ((ctr >> 6) & 0x7);
    config.l1_instruction_size = 4 << ((ctr >> 0) & 0x7);
    
    return config;
}

Cache Line Structure

// Cache line aligned data structure
#define CACHE_LINE_SIZE 64

typedef struct {
    uint8_t data[CACHE_LINE_SIZE];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_line_t;

// Array of cache-line aligned data
typedef struct {
    cache_line_t lines[100];
} cache_aligned_array_t;

// Ensure data fits in cache lines
typedef struct {
    uint32_t value1;
    uint32_t value2;
    char padding[CACHE_LINE_SIZE - 8];  // Padding to next cache line
} __attribute__((aligned(CACHE_LINE_SIZE))) separated_data_t;

📊 Cache Performance Concepts

Locality Principles

Spatial Locality: The tendency to access data that is close together in memory.

Example of Spatial Locality:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Array                             │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
         ↑
    Sequential access pattern
    (Good spatial locality)

Temporal Locality: The tendency to access the same data repeatedly over time.

Example of Temporal Locality:
for (int i = 0; i < 1000; i++) {
    sum += data[i];  // data[i] accessed multiple times
}

Cache Miss Types

  1. Compulsory Misses: First access to data (unavoidable)
  2. Capacity Misses: Cache is too small to hold all needed data
  3. Conflict Misses: Multiple data items map to the same cache location
  4. Coherence Misses: Cache invalidation in multi-core systems

Cache Performance Metrics

Hit Rate: Percentage of memory accesses that result in cache hits

Hit Rate = (Cache Hits) / (Cache Hits + Cache Misses) × 100%

Miss Rate: Percentage of memory accesses that result in cache misses

Miss Rate = (Cache Misses) / (Cache Hits + Cache Misses) × 100%

Average Memory Access Time (AMAT):

AMAT = Hit Time + Miss Rate × Miss Penalty

🔄 Memory Access Patterns

Sequential Access

Sequential access patterns are cache-friendly because they exhibit good spatial locality.

Sequential Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑         ↑         ↑         ↑         ↑         ↑
   Access 1  Access 2  Access 3  Access 4  Access 5  Access 6

Benefits:

Random Access

Random access patterns can cause cache misses and poor performance.

Random Access Pattern:
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑               ↑         ↑               ↑         ↑
   Access 1       Access 2   Access 3       Access 4   Access 5

Challenges:

Strided Access

Strided access patterns access data with a fixed stride (step size).

Strided Access Pattern (stride = 2):
┌─────────────────────────────────────────────────────────────┐
│                    Memory Layout                            │
├─────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│  Data[0]│  Data[1]│  Data[2]│  Data[3]│  Data[4]│  Data[5]  │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┘
    ↑               ↑               ↑
   Access 1       Access 2       Access 3

Optimization Strategies:

🔧 Cache Optimization Techniques

Data Structure Alignment

Align data structures to cache line boundaries to avoid cache line splits.

// Optimize data structures for cache
typedef struct {
    uint32_t frequently_accessed;  // Hot data
    uint32_t rarely_accessed;      // Cold data
    char padding[CACHE_LINE_SIZE - 8];  // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;

// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
    uint32_t x, y, z;
} point_aos_t;

point_aos_t points_aos[1000];  // Array of structures

// SoA - better cache locality
typedef struct {
    uint32_t x[1000];
    uint32_t y[1000];
    uint32_t z[1000];
} points_soa_t;

points_soa_t points_soa;  // Structure of arrays

Cache Line Padding

Use padding to separate frequently accessed data and prevent false sharing.

// Cache line padding to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

Memory Access Optimization

  1. Loop Optimization: Optimize loops for cache-friendly access patterns
  2. Data Layout: Arrange data for sequential access
  3. Prefetching: Prefetch data before it’s needed
  4. Blocking: Process data in cache-sized blocks

🔧 Implementation

Cache Line Optimization

Data Structure Alignment

// Optimize data structures for cache
typedef struct {
    uint32_t frequently_accessed;  // Hot data
    uint32_t rarely_accessed;      // Cold data
    char padding[CACHE_LINE_SIZE - 8];  // Separate to different cache lines
} __attribute__((aligned(CACHE_LINE_SIZE))) hot_cold_separated_t;

// Array of structures (AoS) vs Structure of arrays (SoA)
// AoS - may cause cache misses
typedef struct {
    uint32_t x, y, z;
} point_aos_t;

point_aos_t points_aos[1000];  // Array of structures

// SoA - better cache locality
typedef struct {
    uint32_t x[1000];
    uint32_t y[1000];
    uint32_t z[1000];
} points_soa_t;

points_soa_t points_soa;  // Structure of arrays

Cache Line Padding

// Cache line padding to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

// Array of padded counters for multi-threaded access
padded_counter_t counters[NUM_THREADS];

Memory Access Patterns

Sequential Access Optimization

// Optimized sequential access
void optimized_sequential_access(uint32_t* data, size_t size) {
    // Process data in cache-line sized blocks
    const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
    
    for (size_t i = 0; i < size; i += block_size) {
        size_t end = (i + block_size < size) ? i + block_size : size;
        
        // Process block
        for (size_t j = i; j < end; j++) {
            data[j] = process_data(data[j]);
        }
    }
}

Strided Access Optimization

// Optimized strided access
void optimized_strided_access(uint32_t* data, size_t size, size_t stride) {
    // Use blocking for strided access
    const size_t block_size = CACHE_LINE_SIZE / sizeof(uint32_t);
    
    for (size_t block_start = 0; block_start < size; block_start += block_size * stride) {
        for (size_t offset = 0; offset < stride; offset++) {
            for (size_t i = block_start + offset; i < size; i += stride) {
                if (i < block_start + block_size * stride) {
                    data[i] = process_data(data[i]);
                }
            }
        }
    }
}

🔒 False Sharing Prevention

What is False Sharing?

False sharing occurs when two or more threads access different variables that happen to be on the same cache line, causing unnecessary cache invalidations.

False Sharing Example:
┌─────────────────────────────────────────────────────────────┐
│                    Cache Line                               │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│  Thread 1   │  Thread 2   │  Thread 3   │     Padding      │
│   Counter   │   Counter   │   Counter   │                  │
└─────────────┴─────────────┴─────────────┴───────────────────┘

Prevention Techniques

  1. Cache Line Padding: Add padding to separate variables
  2. Cache Line Alignment: Align structures to cache line boundaries
  3. Data Layout: Arrange data to minimize false sharing
  4. Thread-Local Storage: Use thread-local variables
// Prevent false sharing with padding
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) padded_counter_t;

// Array of padded counters
padded_counter_t counters[NUM_THREADS];

🔄 Cache Prefetching

What is Prefetching?

Prefetching is a technique that loads data into cache before it’s needed, reducing cache misses.

Types of Prefetching

  1. Hardware Prefetching: Automatic prefetching by the CPU
  2. Software Prefetching: Explicit prefetching by the programmer
  3. Compiler Prefetching: Automatic prefetching by the compiler

Software Prefetching

// Software prefetching example
void prefetch_example(uint32_t* data, size_t size) {
    for (size_t i = 0; i < size; i++) {
        // Prefetch next cache line
        if (i + CACHE_LINE_SIZE/sizeof(uint32_t) < size) {
            __builtin_prefetch(&data[i + CACHE_LINE_SIZE/sizeof(uint32_t)], 0, 3);
        }
        
        // Process current data
        data[i] = process_data(data[i]);
    }
}

🔄 Cache Flushing and Invalidation

When to Flush/Invalidate

  1. DMA Operations: Before/after DMA transfers
  2. Multi-core Systems: When sharing data between cores
  3. I/O Operations: Before/after I/O operations
  4. Security: When clearing sensitive data

Implementation

// Cache flush and invalidate functions
void cache_flush(void* addr, size_t size) {
    // Flush cache lines containing the address range
    uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
    uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
    
    for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
        __builtin_arm_dccmvac((void*)addr);
    }
}

void cache_invalidate(void* addr, size_t size) {
    // Invalidate cache lines containing the address range
    uintptr_t start = (uintptr_t)addr & ~(CACHE_LINE_SIZE - 1);
    uintptr_t end = ((uintptr_t)addr + size + CACHE_LINE_SIZE - 1) & ~(CACHE_LINE_SIZE - 1);
    
    for (uintptr_t addr = start; addr < end; addr += CACHE_LINE_SIZE) {
        __builtin_arm_dccimvac((void*)addr);
    }
}

🔄 Multi-core Cache Considerations

Cache Coherency

In multi-core systems, each core has its own cache, and cache coherency protocols ensure data consistency.

Cache Coherency Protocols

  1. MESI Protocol: Modified, Exclusive, Shared, Invalid
  2. MOESI Protocol: Modified, Owned, Exclusive, Shared, Invalid
  3. MSI Protocol: Modified, Shared, Invalid

Multi-core Optimization

  1. False Sharing Prevention: Use padding and alignment
  2. Cache-Aware Data Layout: Arrange data for minimal contention
  3. Thread Affinity: Bind threads to specific cores
  4. NUMA awareness: Consider NUMA architecture
// Multi-core cache-aware data structure
typedef struct {
    uint32_t data[NUM_CORES][CACHE_LINE_SIZE/sizeof(uint32_t)];
} __attribute__((aligned(CACHE_LINE_SIZE))) cache_aligned_data_t;

📊 Performance Profiling

Cache Performance Metrics

  1. Cache Hit Rate: Percentage of cache hits
  2. Cache Miss Rate: Percentage of cache misses
  3. Cache Miss Types: Compulsory, capacity, conflict misses
  4. Memory Bandwidth: Data transfer rate

Profiling Tools

  1. Hardware Counters: CPU performance counters
  2. Cachegrind: Cache simulation tool
  3. perf: Linux performance analysis tool
  4. Intel VTune: Intel performance profiler

Profiling Example

// Cache performance profiling
void profile_cache_performance(void) {
    // Start profiling
    uint64_t start_cycles = __builtin_readcyclecounter();
    
    // Perform cache-intensive operation
    cache_intensive_operation();
    
    // End profiling
    uint64_t end_cycles = __builtin_readcyclecounter();
    uint64_t cycles = end_cycles - start_cycles;
    
    printf("Operation took %llu cycles\n", cycles);
}

⚠️ Common Pitfalls

1. Ignoring Cache Line Boundaries

Problem: Data structures not aligned to cache lines Solution: Use cache line alignment and padding

// Incorrect: Not aligned
typedef struct {
    uint32_t a, b, c;
} unaligned_struct_t;

// Correct: Aligned to cache line
typedef struct {
    uint32_t a, b, c;
    char padding[CACHE_LINE_SIZE - 12];
} __attribute__((aligned(CACHE_LINE_SIZE))) aligned_struct_t;

2. False Sharing

Problem: Multiple threads accessing different variables on same cache line Solution: Use padding and alignment

// Incorrect: Potential false sharing
uint32_t counters[NUM_THREADS];

// Correct: Padded to prevent false sharing
typedef struct {
    uint32_t counter;
    char padding[CACHE_LINE_SIZE - sizeof(uint32_t)];
} padded_counter_t;

padded_counter_t counters[NUM_THREADS];

3. Poor Memory Access Patterns

Problem: Random or strided access patterns Solution: Optimize data layout and access patterns

// Incorrect: Poor access pattern
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        data[j][i] = process(data[j][i]);  // Column-major access
    }
}

// Correct: Better access pattern
for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        data[i][j] = process(data[i][j]);  // Row-major access
    }
}

4. Ignoring Cache Size

Problem: Assuming cache size is larger than actual Solution: Query cache configuration and design accordingly

// Query cache configuration
cache_config_t config = get_cache_config();
printf("L1 Data Cache: %u KB\n", config.l1_data_size);
printf("Cache Line Size: %u bytes\n", config.cache_line_size);

✅ Best Practices

1. Data Structure Design

2. Memory Access Patterns

3. Multi-core Considerations

4. Performance Monitoring

5. Code Organization

🎯 Interview Questions

Basic Questions

  1. What is cache-aware programming?
    • Technique to optimize code for CPU cache efficiency
    • Focuses on spatial and temporal locality
    • Reduces cache misses and improves performance
  2. What are the different types of cache misses?
    • Compulsory misses: First access to data
    • Capacity misses: Cache too small
    • Conflict misses: Multiple data items map to same location
    • Coherence misses: Cache invalidation in multi-core systems
  3. What is false sharing and how do you prevent it?
    • Multiple threads accessing different variables on same cache line
    • Causes unnecessary cache invalidations
    • Prevent with padding and alignment

Advanced Questions

  1. How would you optimize a matrix multiplication for cache performance?
    • Use blocking/tiling techniques
    • Optimize memory access patterns
    • Consider cache line alignment
    • Use cache-aware data structures
  2. How would you design a cache-efficient hash table?
    • Use cache-line aligned buckets
    • Optimize for sequential access
    • Consider cache-friendly collision resolution
    • Use appropriate data structures
  3. How would you profile cache performance in an embedded system?
    • Use hardware performance counters
    • Implement cache simulation
    • Monitor cache hit/miss rates
    • Use profiling tools

Implementation Questions

  1. Write a cache-efficient matrix multiplication algorithm
  2. Implement a cache-aware hash table
  3. Design a cache-friendly data structure for multi-threaded access
  4. Write code to detect and prevent false sharing

📚 Additional Resources

Books

Online Resources

Tools

Standards


Next Steps: Explore Memory Management to understand memory allocation strategies, or dive into Performance Optimization for broader performance techniques.