The "Holy Bible" for embedded engineers
Understanding Memory Barriers and Atomic Operations
Comprehensive coverage of memory ordering models, synchronization primitives, and concurrent programming
Memory ordering refers to the rules that govern how memory operations from different threads or cores appear to execute relative to each other. In modern multi-core systems, memory operations can be reordered by hardware and software optimizations, making the actual execution order different from the program order.
Understanding memory ordering is crucial for writing correct concurrent programs, as the behavior of multi-threaded applications depends on the memory ordering guarantees provided by the underlying hardware and programming language.
Memory ordering embodies the principle of relaxed consistency, where the hardware and compiler are allowed to reorder operations for performance optimization, as long as the reordering doesn’t violate the specified memory ordering constraints.
This approach provides several benefits:
In single-threaded programs, memory operations appear to execute in program order. However, in multi-threaded programs, this assumption can lead to subtle bugs:
Memory Ordering Problem Example:
┌─────────────────────────────────────────────────────────────────┐
│ Thread 1 Thread 2 │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ x = 1; │ │ while (flag == 0) { } │ │
│ │ flag = 1; │ │ print(x); │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
│ │
│ Problem: Without proper memory ordering, Thread 2 might │
│ see flag = 1 before x = 1, printing an uninitialized value │
└─────────────────────────────────────────────────────────────────┘
Memory operations can be reordered in several ways:
Sequential consistency is the strongest memory ordering model, requiring that all memory operations appear to execute in a single sequential order that respects the program order of each thread.
Sequential Consistency Example:
┌─────────────────────────────────────────────────────────────────┐
│ Thread 1: Thread 2: │
│ x = 1; y = 1; │
│ r1 = y; r2 = x; │
│ │
│ Possible outcomes under sequential consistency: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ x │ y │ r1 │ r2 │ Description │ │
│ │ 1 │ 1 │ 0 │ 0 │ Both reads see initial values │ │
│ │ 1 │ 1 │ 1 │ 0 │ Thread 1 sees Thread 2's write │ │
│ │ 1 │ 1 │ 0 │ 1 │ Thread 2 sees Thread 1's write │ │
│ │ 1 │ 1 │ 1 │ 1 │ Both see each other's writes │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Impossible outcome: │
│ x=1, y=1, r1=0, r2=0 (if both reads happen after both writes)│
└─────────────────────────────────────────────────────────────────┘
Total Store Ordering allows store-load reordering but maintains other ordering constraints. This model is used by x86 processors and provides a good balance between performance and programmer expectations.
TSO Memory Ordering:
┌─────────────────────────────────────────────────────────────────┐
│ TSO Constraints: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Load-Load │ Store-Store │ Store-Load │ │
│ │ Ordering │ Ordering │ Ordering │ │
│ │ Maintained │ Maintained │ May be reordered │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Example of allowed reordering: │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Original: store A, load B │ │
│ │ Reordered: load B, store A │ │
│ │ (Store-Load reordering allowed) │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Partial Store Ordering allows both store-load and store-store reordering, providing more flexibility for hardware optimization but requiring more careful programming.
Weak ordering allows most reorderings but requires explicit synchronization operations to establish ordering constraints. This model provides maximum flexibility for hardware optimization.
Release consistency provides synchronization at specific points (acquire and release operations) while allowing reordering between these points.
Memory barriers (also called fences) are instructions that enforce ordering constraints on memory operations. They prevent certain types of reordering and ensure that memory operations are visible to other threads in the expected order.
A load-load barrier ensures that all loads before the barrier complete before any loads after the barrier begin.
Load-Load Barrier Example:
┌─────────────────────────────────────────────────────────────────┐
│ Without Barrier: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ load A │ load B │ load C │ │
│ │ │ │ │ │
│ │ Possible reordering: load B, load A, load C │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ With Load-Load Barrier: │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ load A │ LL Barrier │ load B │ │
│ │ │ │ │ │
│ │ load C │ │ load D │ │
│ │ │ │ │ │
│ │ A and C must complete before B and D begin │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
A store-store barrier ensures that all stores before the barrier complete before any stores after the barrier begin.
Store-Store Barrier Example:
┌─────────────────────────────────────────────────────────────────┐
│ Without Barrier: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ store A │ store B │ store C │ │
│ │ │ │ │ │
│ │ Possible reordering: store B, store A, store C │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ With Store-Store Barrier: │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ store A │ SS Barrier │ store B │ │
│ │ │ │ │ │
│ │ store C │ │ store D │ │
│ │ │ │ │ │
│ │ A and C must complete before B and D begin │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
A load-store barrier ensures that all loads before the barrier complete before any stores after the barrier begin.
A store-load barrier ensures that all stores before the barrier complete before any loads after the barrier begin.
A full memory barrier ensures that all memory operations before the barrier complete before any memory operations after the barrier begin.
Memory barriers are implemented differently on different architectures:
// x86 memory barriers
#include <immintrin.h>
void x86_memory_barriers() {
// Compiler barrier (prevents compiler reordering)
__asm__ __volatile__("" ::: "memory");
// Full memory barrier
_mm_mfence();
// Store barrier
_mm_sfence();
// Load barrier
_mm_lfence();
}
// ARM memory barriers
void arm_memory_barriers() {
// Full memory barrier
__asm__ __volatile__("dmb ish" ::: "memory");
// Store barrier
__asm__ __volatile__("dmb ishst" ::: "memory");
// Load barrier
__asm__ __volatile__("dmb ishld" ::: "memory");
}
Atomic operations are operations that appear to execute as a single, indivisible unit. They are essential for implementing synchronization primitives and ensuring correct behavior in concurrent programs.
Read-modify-write operations atomically read a value, modify it, and write it back:
Atomic Compare-and-Swap:
┌─────────────────────────────────────────────────────────────────┐
│ CAS Operation: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ bool CAS(T* ptr, T expected, T desired) │ │
│ │ { │ │
│ │ if (*ptr == expected) { │ │
│ │ *ptr = desired; │ │
│ │ return true; │ │
│ │ } │ │
│ │ return false; │ │
│ │ } │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Usage Example: │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ while (!CAS(&lock, 0, 1)) { │ │
│ │ // Spin until lock is acquired │ │
│ │ } │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Atomic load and store operations ensure that the operation is performed atomically:
// C11 atomic operations
#include <stdatomic.h>
void atomic_operations_example() {
atomic_int counter = ATOMIC_VAR_INIT(0);
// Atomic load
int value = atomic_load(&counter);
// Atomic store
atomic_store(&counter, 42);
// Atomic fetch-and-add
int old_value = atomic_fetch_add(&counter, 10);
// Atomic compare-and-swap
int expected = 42;
int desired = 43;
bool success = atomic_compare_exchange_weak(&counter, &expected, desired);
}
Atomic operations can specify memory ordering constraints:
// Atomic operations with memory ordering
void atomic_with_ordering() {
atomic_int flag = ATOMIC_VAR_INIT(0);
atomic_int data = ATOMIC_VAR_INIT(0);
// Release store: ensures all previous operations are visible
atomic_store_explicit(&data, 42, memory_order_release);
// Acquire load: ensures all subsequent operations are visible
int value = atomic_load_explicit(&data, memory_order_acquire);
// Relaxed operation: no ordering guarantees
atomic_fetch_add_explicit(&flag, 1, memory_order_relaxed);
}
Mutexes can be implemented using atomic operations and memory barriers:
// Simple spinlock mutex implementation
typedef struct {
atomic_int locked;
} spinlock_t;
void spinlock_init(spinlock_t* lock) {
atomic_store(&lock->locked, 0);
}
void spinlock_acquire(spinlock_t* lock) {
while (atomic_exchange_explicit(&lock->locked, 1,
memory_order_acquire)) {
// Spin until lock is acquired
while (atomic_load_explicit(&lock->locked,
memory_order_relaxed)) {
// Optional: yield or pause
}
}
}
void spinlock_release(spinlock_t* lock) {
atomic_store_explicit(&lock->locked, 0, memory_order_release);
}
Semaphores can also be implemented using atomic operations:
// Binary semaphore implementation
typedef struct {
atomic_int count;
} semaphore_t;
void semaphore_init(semaphore_t* sem, int initial_count) {
atomic_store(&sem->count, initial_count);
}
void semaphore_wait(semaphore_t* sem) {
int expected;
do {
expected = atomic_load(&sem->count);
if (expected <= 0) {
// Wait for signal
continue;
}
} while (!atomic_compare_exchange_weak(&sem->count, &expected, expected - 1));
}
void semaphore_signal(semaphore_t* sem) {
atomic_fetch_add(&sem->count, 1);
}
Cache coherency protocols ensure that all cores see a consistent view of memory, but they don’t guarantee memory ordering. Memory barriers are still needed to establish ordering constraints.
Cache Coherency vs. Memory Ordering:
┌─────────────────────────────────────────────────────────────────┐
│ Cache Coherency: │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Ensures all cores see the same value for a memory location │ │
│ │ Does NOT guarantee the order of operations │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Memory Ordering: │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Establishes the order of operations │ │
│ │ Requires explicit memory barriers │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Different processor architectures provide different memory ordering guarantees:
mfence, sfence, lfence instructionsdmb, dsb, isb instructionssync, lwsync, ptsync instructionsMemory barriers have performance costs that vary by architecture:
Several strategies can minimize the performance impact of memory barriers:
// Minimize barrier usage
void optimized_synchronization() {
// Do work that doesn't need ordering
int local_result = compute_something();
// Place barrier as late as possible
atomic_thread_fence(memory_order_release);
// Only the final result needs ordering
atomic_store(&shared_result, local_result);
}
// Batch operations to reduce barrier overhead
void batched_operations() {
// Collect multiple updates
int updates[10];
for (int i = 0; i < 10; i++) {
updates[i] = compute_update(i);
}
// Single barrier for all updates
atomic_thread_fence(memory_order_release);
// Apply all updates atomically
for (int i = 0; i < 10; i++) {
atomic_store(&shared_data[i], updates[i]);
}
}
Memory ordering issues can be difficult to debug:
This comprehensive guide to Memory Ordering provides the foundation for understanding how modern multi-core systems handle concurrent memory access. The concepts covered here are essential for embedded software engineers working with concurrent programming and understanding the behavior of multi-threaded applications.