The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Memory Ordering

Understanding Memory Barriers and Atomic Operations
Comprehensive coverage of memory ordering models, synchronization primitives, and concurrent programming


📋 Table of Contents


🏗️ Memory Ordering Fundamentals

What is Memory Ordering?

Memory ordering refers to the rules that govern how memory operations from different threads or cores appear to execute relative to each other. In modern multi-core systems, memory operations can be reordered by hardware and software optimizations, making the actual execution order different from the program order.

Understanding memory ordering is crucial for writing correct concurrent programs, as the behavior of multi-threaded applications depends on the memory ordering guarantees provided by the underlying hardware and programming language.

Memory Ordering Philosophy

Memory ordering embodies the principle of relaxed consistency, where the hardware and compiler are allowed to reorder operations for performance optimization, as long as the reordering doesn’t violate the specified memory ordering constraints.

This approach provides several benefits:

  1. Performance: Allows hardware and compiler optimizations
  2. Scalability: Enables better performance in multi-core systems
  3. Flexibility: Provides different ordering guarantees for different use cases
  4. Efficiency: Minimizes unnecessary synchronization overhead

Why Memory Ordering Matters

In single-threaded programs, memory operations appear to execute in program order. However, in multi-threaded programs, this assumption can lead to subtle bugs:

Memory Ordering Problem Example:
┌─────────────────────────────────────────────────────────────────┐
│  Thread 1                    Thread 2                          │
│  ┌─────────────────────────┐ ┌─────────────────────────────┐   │
│  │ x = 1;                  │ │ while (flag == 0) { }       │   │
│  │ flag = 1;               │ │ print(x);                   │   │
│  └─────────────────────────┘ └─────────────────────────────┘   │
│                                                                 │
│  Problem: Without proper memory ordering, Thread 2 might      │
│  see flag = 1 before x = 1, printing an uninitialized value   │
└─────────────────────────────────────────────────────────────────┘

Memory Reordering Types

Memory operations can be reordered in several ways:

  1. Compiler Reordering: Compiler optimizations that reorder operations
  2. Hardware Reordering: CPU out-of-order execution and memory subsystem reordering
  3. Cache Coherency: Different cores seeing operations in different orders
  4. Memory Controller Reordering: Memory controller optimizing access patterns

📊 Memory Ordering Models

Sequential Consistency

Sequential consistency is the strongest memory ordering model, requiring that all memory operations appear to execute in a single sequential order that respects the program order of each thread.

Sequential Consistency Example:
┌─────────────────────────────────────────────────────────────────┐
│  Thread 1:        Thread 2:                                    │
│  x = 1;           y = 1;                                       │
│  r1 = y;          r2 = x;                                       │
│                                                                 │
│  Possible outcomes under sequential consistency:                │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ x    │ y    │ r1   │ r2   │  Description                    │ │
│  │ 1    │ 1    │ 0    │ 0    │  Both reads see initial values  │ │
│  │ 1    │ 1    │ 1    │ 0    │  Thread 1 sees Thread 2's write │ │
│  │ 1    │ 1    │ 0    │ 1    │  Thread 2 sees Thread 1's write │ │
│  │ 1    │ 1    │ 1    │ 1    │  Both see each other's writes   │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Impossible outcome:                                            │
│  x=1, y=1, r1=0, r2=0 (if both reads happen after both writes)│
└─────────────────────────────────────────────────────────────────┘

Total Store Ordering (TSO)

Total Store Ordering allows store-load reordering but maintains other ordering constraints. This model is used by x86 processors and provides a good balance between performance and programmer expectations.

TSO Memory Ordering:
┌─────────────────────────────────────────────────────────────────┐
│  TSO Constraints:                                              │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Load-Load   │ Store-Store │  Store-Load                     │ │
│  │ Ordering    │ Ordering    │  Ordering                       │ │
│  │ Maintained  │ Maintained  │  May be reordered               │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Example of allowed reordering:                                │ │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │ Original:   store A, load B                                │ │
│  │ Reordered:  load B, store A                                │ │
│  │ (Store-Load reordering allowed)                            │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Partial Store Ordering (PSO)

Partial Store Ordering allows both store-load and store-store reordering, providing more flexibility for hardware optimization but requiring more careful programming.

Weak Ordering

Weak ordering allows most reorderings but requires explicit synchronization operations to establish ordering constraints. This model provides maximum flexibility for hardware optimization.

Release Consistency

Release consistency provides synchronization at specific points (acquire and release operations) while allowing reordering between these points.


🚧 Memory Barriers and Fences

Memory Barrier Fundamentals

Memory barriers (also called fences) are instructions that enforce ordering constraints on memory operations. They prevent certain types of reordering and ensure that memory operations are visible to other threads in the expected order.

Types of Memory Barriers

Load-Load Barrier (LL)

A load-load barrier ensures that all loads before the barrier complete before any loads after the barrier begin.

Load-Load Barrier Example:
┌─────────────────────────────────────────────────────────────────┐
│  Without Barrier:                                              │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ load A      │ load B      │  load C                         │ │
│  │             │             │                                 │ │
│  │ Possible reordering: load B, load A, load C                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  With Load-Load Barrier:                                       │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ load A      │ LL Barrier  │  load B                         │ │
│  │             │             │                                 │ │
│  │ load C      │             │  load D                         │ │
│  │             │             │                                 │ │
│  │ A and C must complete before B and D begin                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Store-Store Barrier (SS)

A store-store barrier ensures that all stores before the barrier complete before any stores after the barrier begin.

Store-Store Barrier Example:
┌─────────────────────────────────────────────────────────────────┐
│  Without Barrier:                                              │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ store A     │ store B     │  store C                        │ │
│  │             │             │                                 │ │
│  │ Possible reordering: store B, store A, store C              │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  With Store-Store Barrier:                                     │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ store A     │ SS Barrier  │  store B                        │ │
│  │             │             │                                 │ │
│  │ store C     │             │  store D                        │ │
│  │             │             │                                 │ │
│  │ A and C must complete before B and D begin                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Load-Store Barrier (LS)

A load-store barrier ensures that all loads before the barrier complete before any stores after the barrier begin.

Store-Load Barrier (SL)

A store-load barrier ensures that all stores before the barrier complete before any loads after the barrier begin.

Full Memory Barrier (MF)

A full memory barrier ensures that all memory operations before the barrier complete before any memory operations after the barrier begin.

Memory Barrier Implementation

Memory barriers are implemented differently on different architectures:

x86 Memory Barriers

// x86 memory barriers
#include <immintrin.h>

void x86_memory_barriers() {
    // Compiler barrier (prevents compiler reordering)
    __asm__ __volatile__("" ::: "memory");
    
    // Full memory barrier
    _mm_mfence();
    
    // Store barrier
    _mm_sfence();
    
    // Load barrier
    _mm_lfence();
}

ARM Memory Barriers

// ARM memory barriers
void arm_memory_barriers() {
    // Full memory barrier
    __asm__ __volatile__("dmb ish" ::: "memory");
    
    // Store barrier
    __asm__ __volatile__("dmb ishst" ::: "memory");
    
    // Load barrier
    __asm__ __volatile__("dmb ishld" ::: "memory");
}

⚛️ Atomic Operations

Atomic Operation Fundamentals

Atomic operations are operations that appear to execute as a single, indivisible unit. They are essential for implementing synchronization primitives and ensuring correct behavior in concurrent programs.

Types of Atomic Operations

Read-Modify-Write Operations

Read-modify-write operations atomically read a value, modify it, and write it back:

  1. Compare-and-Swap (CAS): Atomically compare and conditionally swap
  2. Fetch-and-Add: Atomically add a value and return the old value
  3. Test-and-Set: Atomically set a value and return the old value
Atomic Compare-and-Swap:
┌─────────────────────────────────────────────────────────────────┐
│  CAS Operation:                                                │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ bool CAS(T* ptr, T expected, T desired)                    │ │
│  │ {                                                           │ │
│  │     if (*ptr == expected) {                                 │ │
│  │         *ptr = desired;                                     │ │
│  │         return true;                                        │ │
│  │     }                                                       │ │
│  │     return false;                                           │ │
│  │ }                                                           │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Usage Example:                                                │ │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │ while (!CAS(&lock, 0, 1)) {                                │ │
│  │     // Spin until lock is acquired                          │ │
│  │ }                                                           │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Load and Store Operations

Atomic load and store operations ensure that the operation is performed atomically:

// C11 atomic operations
#include <stdatomic.h>

void atomic_operations_example() {
    atomic_int counter = ATOMIC_VAR_INIT(0);
    
    // Atomic load
    int value = atomic_load(&counter);
    
    // Atomic store
    atomic_store(&counter, 42);
    
    // Atomic fetch-and-add
    int old_value = atomic_fetch_add(&counter, 10);
    
    // Atomic compare-and-swap
    int expected = 42;
    int desired = 43;
    bool success = atomic_compare_exchange_weak(&counter, &expected, desired);
}

Memory Ordering in Atomic Operations

Atomic operations can specify memory ordering constraints:

// Atomic operations with memory ordering
void atomic_with_ordering() {
    atomic_int flag = ATOMIC_VAR_INIT(0);
    atomic_int data = ATOMIC_VAR_INIT(0);
    
    // Release store: ensures all previous operations are visible
    atomic_store_explicit(&data, 42, memory_order_release);
    
    // Acquire load: ensures all subsequent operations are visible
    int value = atomic_load_explicit(&data, memory_order_acquire);
    
    // Relaxed operation: no ordering guarantees
    atomic_fetch_add_explicit(&flag, 1, memory_order_relaxed);
}

🔒 Synchronization Primitives

Mutex Implementation with Memory Ordering

Mutexes can be implemented using atomic operations and memory barriers:

// Simple spinlock mutex implementation
typedef struct {
    atomic_int locked;
} spinlock_t;

void spinlock_init(spinlock_t* lock) {
    atomic_store(&lock->locked, 0);
}

void spinlock_acquire(spinlock_t* lock) {
    while (atomic_exchange_explicit(&lock->locked, 1, 
                                   memory_order_acquire)) {
        // Spin until lock is acquired
        while (atomic_load_explicit(&lock->locked, 
                                   memory_order_relaxed)) {
            // Optional: yield or pause
        }
    }
}

void spinlock_release(spinlock_t* lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Semaphore Implementation

Semaphores can also be implemented using atomic operations:

// Binary semaphore implementation
typedef struct {
    atomic_int count;
} semaphore_t;

void semaphore_init(semaphore_t* sem, int initial_count) {
    atomic_store(&sem->count, initial_count);
}

void semaphore_wait(semaphore_t* sem) {
    int expected;
    do {
        expected = atomic_load(&sem->count);
        if (expected <= 0) {
            // Wait for signal
            continue;
        }
    } while (!atomic_compare_exchange_weak(&sem->count, &expected, expected - 1));
}

void semaphore_signal(semaphore_t* sem) {
    atomic_fetch_add(&sem->count, 1);
}

🔄 Memory Ordering in Multi-core Systems

Cache Coherency and Memory Ordering

Cache coherency protocols ensure that all cores see a consistent view of memory, but they don’t guarantee memory ordering. Memory barriers are still needed to establish ordering constraints.

Cache Coherency vs. Memory Ordering:
┌─────────────────────────────────────────────────────────────────┐
│  Cache Coherency:                                              │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Ensures all cores see the same value for a memory location │ │
│  │ Does NOT guarantee the order of operations                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Memory Ordering:                                               │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Establishes the order of operations                         │ │
│  │ Requires explicit memory barriers                           │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Memory Ordering in Different Architectures

Different processor architectures provide different memory ordering guarantees:

x86/x64 Architecture

ARM Architecture

PowerPC Architecture


Performance Implications

Memory Barrier Performance Cost

Memory barriers have performance costs that vary by architecture:

  1. Pipeline Stalls: Barriers can cause pipeline stalls
  2. Memory Access Serialization: Some barriers serialize memory access
  3. Cache Effects: Barriers can affect cache behavior
  4. Compiler Limitations: Barriers limit compiler optimizations

Optimization Strategies

Several strategies can minimize the performance impact of memory barriers:

Barrier Placement Optimization

// Minimize barrier usage
void optimized_synchronization() {
    // Do work that doesn't need ordering
    int local_result = compute_something();
    
    // Place barrier as late as possible
    atomic_thread_fence(memory_order_release);
    
    // Only the final result needs ordering
    atomic_store(&shared_result, local_result);
}

Batching Operations

// Batch operations to reduce barrier overhead
void batched_operations() {
    // Collect multiple updates
    int updates[10];
    for (int i = 0; i < 10; i++) {
        updates[i] = compute_update(i);
    }
    
    // Single barrier for all updates
    atomic_thread_fence(memory_order_release);
    
    // Apply all updates atomically
    for (int i = 0; i < 10; i++) {
        atomic_store(&shared_data[i], updates[i]);
    }
}

🎯 Best Practices and Guidelines

General Memory Ordering Guidelines

  1. Use the Weakest Ordering Possible: Choose the weakest memory ordering that provides the required guarantees
  2. Understand Your Target Architecture: Different architectures have different default behaviors
  3. Test Thoroughly: Memory ordering bugs can be subtle and architecture-dependent
  4. Document Assumptions: Clearly document memory ordering requirements

Common Pitfalls to Avoid

  1. Assuming Sequential Consistency: Don’t assume operations execute in program order
  2. Ignoring Compiler Reordering: Compiler optimizations can reorder operations
  3. Mixing Memory Models: Be consistent with memory ordering within a program
  4. Over-synchronization: Don’t use stronger ordering than necessary

Debugging Memory Ordering Issues

Memory ordering issues can be difficult to debug:

  1. Use Memory Ordering Tools: Tools like ThreadSanitizer can detect some issues
  2. Stress Testing: Run concurrent tests under various system loads
  3. Architecture-Specific Testing: Test on different processor architectures
  4. Formal Verification: Use formal methods for critical concurrent code

📚 Further Reading and Resources


This comprehensive guide to Memory Ordering provides the foundation for understanding how modern multi-core systems handle concurrent memory access. The concepts covered here are essential for embedded software engineers working with concurrent programming and understanding the behavior of multi-threaded applications.