The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Vector Processing

Understanding SIMD Instructions and Vectorization
Comprehensive coverage of SIMD instructions, vectorization techniques, and vector processing optimization


📋 Table of Contents


🏗️ Vector Processing Fundamentals

What is Vector Processing?

Vector processing is a computing paradigm that performs the same operation on multiple data elements simultaneously. Unlike scalar processing, which operates on one data element at a time, vector processing operates on arrays or vectors of data, dramatically improving performance for data-parallel applications.

The fundamental concept behind vector processing is Single Instruction, Multiple Data (SIMD), where a single instruction controls the processing of multiple data elements. This approach is particularly effective for applications that exhibit data parallelism, such as image processing, signal processing, scientific computing, and multimedia applications.

Vector Processing Philosophy

Vector processing embodies the principle of data parallelism, where the same computational pattern is applied to multiple data elements. This approach leverages the natural parallelism present in many algorithms and provides several key benefits:

  1. Increased Throughput: Multiple data elements processed per instruction
  2. Better Resource Utilization: Hardware resources used more efficiently
  3. Reduced Instruction Overhead: Fewer instructions needed for the same computation
  4. Scalable Performance: Performance scales with vector width

Vector vs. Scalar Processing

Scalar Processing:
┌─────────────────────────────────────────────────────────────────┐
│  Scalar Addition Example                                       │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │  a[0]   │  a[1]   │  a[2]   │  a[3]                       │
│  │    +     │    +     │    +     │    +                        │
│  │  b[0]   │  b[1]   │  b[2]   │  b[3]                       │
│  │    =     │    =     │    =     │    =                        │
│  │  c[0]   │  c[1]   │  c[2]   │  c[3]                       │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  Requires 4 separate instructions                               │
└─────────────────────────────────────────────────────────────────┘

Vector Processing:
┌─────────────────────────────────────────────────────────────────┐
│  Vector Addition Example                                        │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │  a[0]   │  a[1]   │  a[2]   │  a[3]                       │
│  │    +     │    +     │    +     │    +                        │
│  │  b[0]   │  b[1]   │  b[2]   │  b[3]                       │
│  │    =     │    =     │    =     │    =                        │
│  │  c[0]   │  c[1]   │  c[2]   │  c[3]                       │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  Requires 1 vector instruction                                  │
└─────────────────────────────────────────────────────────────────┘

Vector Processing Applications

Vector processing is particularly effective for applications that exhibit:

  1. Data Parallelism: Same operation applied to multiple data elements
  2. Regular Memory Access Patterns: Predictable data access patterns
  3. High Computational Density: Significant computation per data element
  4. Large Data Sets: Operations on substantial amounts of data

Common applications include:


🚀 SIMD Instruction Sets

SIMD Fundamentals

SIMD (Single Instruction, Multiple Data) instruction sets are the hardware foundation for vector processing. These instruction sets extend traditional scalar instruction sets with instructions that operate on multiple data elements simultaneously.

SIMD instructions typically operate on:

Common SIMD Instruction Sets

SIMD Instruction Set Evolution:
┌─────────────────────────────────────────────────────────────────┐
│  x86 SIMD Evolution                                            │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ MMX     │ SSE     │ AVX     │  AVX-512                     │
│  │ (64-bit)│(128-bit)│(256-bit)│  (512-bit)                   │
│  │ 1997    │ 1999    │ 2011    │  2016                        │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  ARM SIMD Evolution                                             │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ NEON    │ SVE     │ SVE2    │  Future extensions           │
│  │(128-bit)│(variable)│(variable)│                              │
│  │ 2005    │ 2016    │ 2019    │                              │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

x86 SIMD Instructions

The x86 architecture has evolved through several SIMD instruction set extensions:

MMX (Multimedia Extensions)

SSE (Streaming SIMD Extensions)

AVX (Advanced Vector Extensions)

AVX-512

ARM NEON Instructions

ARM NEON provides SIMD capabilities for ARM processors:

SIMD Data Types

SIMD instructions support various packed data types:

SIMD Data Type Examples:
┌─────────────────────────────────────────────────────────────────┐
│  128-bit SIMD Register                                         │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 8-bit   │ 8-bit   │ 8-bit   │  8-bit                      │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (16x)  │  (16x)  │  (16x)  │   (16x)                     │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 16-bit  │ 16-bit  │ 16-bit  │  16-bit                     │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (8x)   │  (8x)   │  (8x)   │   (8x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 32-bit  │ 32-bit  │ 32-bit  │  32-bit                     │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (4x)   │  (4x)   │  (4x)   │   (4x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 64-bit  │ 64-bit  │         │                             │
│  │ Integer │ Integer │         │                             │
│  │  (2x)   │  (2x)   │         │                             │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 32-bit  │ 32-bit  │ 32-bit  │  32-bit                     │
│  │ Float   │ Float   │ Float   │  Float                      │
│  │  (4x)   │  (4x)   │  (4x)   │   (4x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

🔄 Vectorization Techniques

Vectorization Fundamentals

Vectorization is the process of converting scalar code to use vector instructions. This transformation can be performed automatically by compilers or manually by programmers to improve performance.

Automatic Vectorization

Modern compilers can automatically vectorize loops that meet certain criteria:

  1. Loop Independence: Loop iterations must be independent
  2. Regular Memory Access: Predictable memory access patterns
  3. Data Alignment: Data must be properly aligned for vector operations
  4. Loop Bounds: Loop bounds must be known at compile time or runtime
Automatic Vectorization Example:
┌─────────────────────────────────────────────────────────────────┐
│  Original Scalar Code                                          │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  for (int i = 0; i < n; i++) {                             │
│  │      c[i] = a[i] + b[i];                                   │
│  │  }                                                          │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│  Compiler-Generated Vector Code                                 │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  // Vectorized loop (assuming 4-wide SIMD)                 │
│  │  for (int i = 0; i < n; i += 4) {                          │
│  │      // Load 4 elements from a and b                        │
│  │      __m128 va = _mm_load_ps(&a[i]);                       │
│  │      __m128 vb = _mm_load_ps(&b[i]);                       │
│  │      // Add vectors                                         │
│  │      __m128 vc = _mm_add_ps(va, vb);                       │
│  │      // Store result                                        │
│  │      _mm_store_ps(&c[i], vc);                              │
│  │  }                                                          │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Manual Vectorization

Manual vectorization gives programmers control over the vectorization process and can achieve better performance than automatic vectorization in many cases.

Intrinsic Functions

Intrinsic functions provide a C/C++ interface to SIMD instructions:

// x86 SSE intrinsic example
#include <immintrin.h>

void vector_add_sse(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_add_ps(va, vb);
        _mm_store_ps(&c[i], vc);
    }
}

Assembly Language

Direct assembly language programming provides maximum control:

; x86 SSE assembly example
vector_add_sse:
    mov eax, [esp+4]    ; a
    mov ebx, [esp+8]    ; b
    mov ecx, [esp+12]   ; c
    mov edx, [esp+16]   ; n
    
loop_start:
    movaps xmm0, [eax]  ; Load 4 floats from a
    movaps xmm1, [ebx]  ; Load 4 floats from b
    addps xmm0, xmm1    ; Add vectors
    movaps [ecx], xmm0  ; Store result to c
    
    add eax, 16         ; Advance pointers
    add ebx, 16
    add ecx, 16
    sub edx, 4          ; Decrement counter
    jnz loop_start
    
    ret

Vectorization Strategies

Loop Unrolling

Loop unrolling reduces loop overhead and improves instruction-level parallelism:

// Unrolled loop for better vectorization
void vector_add_unrolled(float* a, float* b, float* c, int n) {
    int i = 0;
    // Process 4 elements at a time
    for (; i <= n - 4; i += 4) {
        c[i] = a[i] + b[i];
        c[i+1] = a[i+1] + b[i+1];
        c[i+2] = a[i+2] + b[i+2];
        c[i+3] = a[i+3] + b[i+3];
    }
    // Handle remaining elements
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Data Alignment

Proper data alignment is crucial for optimal vector performance:

// Aligned memory allocation for vector operations
#include <immintrin.h>

float* allocate_aligned_float(int n) {
    float* ptr = (float*)_aligned_malloc(n * sizeof(float), 16);
    return ptr;
}

void vector_add_aligned(float* a, float* b, float* c, int n) {
    // Ensure pointers are 16-byte aligned
    assert(((uintptr_t)a & 0xF) == 0);
    assert(((uintptr_t)b & 0xF) == 0);
    assert(((uintptr_t)c & 0xF) == 0);
    
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_add_ps(va, vb);
        _mm_store_ps(&c[i], vc);
    }
}

🏗️ Vector Processing Architectures

Vector Register Architecture

Vector processing architectures typically include:

  1. Vector Registers: Wide registers for storing vector data
  2. Vector Functional Units: Specialized execution units for vector operations
  3. Memory Interface: High-bandwidth memory access for vector loads/stores
  4. Control Logic: Hardware for managing vector operations
Vector Register Architecture:
┌─────────────────────────────────────────────────────────────────┐
│  Vector Processing Unit                                        │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Vector      │ Vector      │  Vector                         │ │
│  │ Registers   │ Functional  │  Memory                         │ │
│  │ (V0-V31)    │ Units       │  Interface                      │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Vector      │ Vector      │  Vector                         │ │
│  │ Load/Store  │ Arithmetic  │  Control                        │ │
│  │ Unit        │ Unit        │  Logic                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Vector Memory Access Patterns

Vector processors support various memory access patterns:

  1. Unit Stride: Consecutive memory locations
  2. Strided: Fixed stride between elements
  3. Gather/Scatter: Non-contiguous memory access
  4. Interleaved: Alternating data from different arrays
Memory Access Patterns:
┌─────────────────────────────────────────────────────────────────┐
│  Unit Stride Access                                            │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ Memory  │ Address │ Data    │  Pattern                     │ │
│  │ 0x1000  │ 0x1000  │ a[0]    │  Consecutive addresses      │ │
│  │ 0x1004  │ 0x1004  │ a[1]    │                             │ │
│  │ 0x1008  │ 0x1008  │ a[2]    │                             │ │
│  │ 0x100C  │ 0x100C  │ a[3]    │                             │ │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Strided Access                                                │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ Memory  │ Address │ Data    │  Pattern                     │ │
│  │ 0x1000  │ 0x1000  │ a[0]    │  Fixed stride of 8 bytes    │ │
│  │ 0x1008  │ 0x1008  │ a[1]    │                             │ │
│  │ 0x1010  │ 0x1010  │ a[2]    │                             │ │
│  │ 0x1018  │ 0x1018  │ a[3]    │                             │ │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Vector Length and Masking

Modern vector architectures support variable-length vectors and predicated execution:

  1. Variable Length: Vector length can be adjusted at runtime
  2. Predication: Conditional execution of vector elements
  3. Masking: Control which elements participate in operations

Performance Optimization

Vectorization Efficiency

Vectorization efficiency measures how well scalar code is converted to vector operations:

Vectorization Efficiency = (Vector Operations / Total Operations) × 100%

Example:
- Original scalar code: 1000 operations
- Vectorized code: 250 vector operations (4-wide SIMD)
- Efficiency: (250/1000) × 100% = 25%

Memory Bandwidth Optimization

Vector operations are often memory-bound, making memory bandwidth optimization crucial:

  1. Cache Optimization: Maximize cache utilization
  2. Memory Access Patterns: Use cache-friendly access patterns
  3. Data Locality: Minimize cache misses
  4. Prefetching: Use hardware and software prefetching

Instruction-Level Parallelism

Vector operations can be combined with other optimization techniques:

  1. Loop Unrolling: Reduce loop overhead
  2. Software Pipelining: Overlap loop iterations
  3. Register Blocking: Maximize register utilization
  4. Instruction Scheduling: Optimize instruction ordering

🔧 Vector Processing in Embedded Systems

Embedded Vector Processing

Embedded systems often have specific requirements for vector processing:

  1. Power Efficiency: Minimize power consumption
  2. Real-time Performance: Meet timing constraints
  3. Memory Constraints: Work within limited memory
  4. Cost Sensitivity: Balance performance and cost

ARM NEON in Embedded Systems

ARM NEON is widely used in embedded systems due to its power efficiency and performance:

// ARM NEON example for embedded systems
#include <arm_neon.h>

void vector_add_neon(float32_t* a, float32_t* b, float32_t* c, int n) {
    for (int i = 0; i < n; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vc = vaddq_f32(va, vb);
        vst1q_f32(&c[i], vc);
    }
}

DSP Vector Processing

Digital Signal Processors (DSPs) often include specialized vector instructions:

  1. MAC Operations: Multiply-accumulate operations
  2. Circular Buffering: Efficient buffer management
  3. Bit Manipulation: Specialized bit operations
  4. Fixed-point Arithmetic: Optimized for fixed-point operations

💻 Programming Models

Vector Programming Models

Several programming models support vector processing:

  1. Intrinsic Functions: C/C++ extensions for SIMD instructions
  2. Auto-vectorization: Compiler-based vectorization
  3. Vector Libraries: High-level vector operations
  4. Domain-Specific Languages: Specialized languages for vector processing

OpenMP Vectorization

OpenMP provides directives for vectorization:

#include <omp.h>

void vector_add_openmp(float* a, float* b, float* c, int n) {
    #pragma omp simd
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

SIMD Libraries

Various libraries provide high-level SIMD operations:

  1. Intel MKL: Math Kernel Library
  2. ARM Compute Library: ARM-optimized operations
  3. OpenBLAS: Open Basic Linear Algebra Subprograms
  4. Eigen: C++ template library for linear algebra

Advanced Vector Extensions

Future vector processing developments may include:

  1. Variable-Length Vectors: Dynamic vector length adjustment
  2. Advanced Predication: Sophisticated conditional execution
  3. Domain-Specific Instructions: Specialized instructions for specific applications
  4. AI/ML Acceleration: Vector instructions optimized for machine learning

Heterogeneous Vector Processing

Future systems may combine different types of vector processors:

  1. CPU + GPU Integration: Unified vector processing
  2. Specialized Accelerators: Domain-specific vector units
  3. Neuromorphic Computing: Brain-inspired vector processing
  4. Quantum-Classical Hybrid: Quantum-inspired classical vector operations

📚 Further Reading and Resources


This comprehensive guide to Vector Processing provides the foundation for understanding how modern processors achieve high performance through data parallelism. The concepts covered here are essential for embedded software engineers working with performance-critical applications and understanding vector processing capabilities.