The Embedded New Testament

The "Holy Bible" for embedded engineers

Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Vector Processing

Understanding SIMD Instructions and Vectorization
Comprehensive coverage of SIMD instructions, vectorization techniques, and vector processing optimization

🏗️ Vector Processing Fundamentals

What is Vector Processing?

Vector processing is a computing paradigm that performs the same operation on multiple data elements simultaneously. Unlike scalar processing, which operates on one data element at a time, vector processing operates on arrays or vectors of data, dramatically improving performance for data-parallel applications.

The fundamental concept behind vector processing is Single Instruction, Multiple Data (SIMD), where a single instruction controls the processing of multiple data elements. This approach is particularly effective for applications that exhibit data parallelism, such as image processing, signal processing, scientific computing, and multimedia applications.

Vector Processing Philosophy

Vector processing embodies the principle of data parallelism, where the same computational pattern is applied to multiple data elements. This approach leverages the natural parallelism present in many algorithms and provides several key benefits:

Increased Throughput: Multiple data elements processed per instruction
Better Resource Utilization: Hardware resources used more efficiently
Reduced Instruction Overhead: Fewer instructions needed for the same computation
Scalable Performance: Performance scales with vector width

Vector vs. Scalar Processing

Scalar Processing:
┌─────────────────────────────────────────────────────────────────┐
│  Scalar Addition Example                                       │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │  a[0]   │  a[1]   │  a[2]   │  a[3]                       │
│  │    +     │    +     │    +     │    +                        │
│  │  b[0]   │  b[1]   │  b[2]   │  b[3]                       │
│  │    =     │    =     │    =     │    =                        │
│  │  c[0]   │  c[1]   │  c[2]   │  c[3]                       │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  Requires 4 separate instructions                               │
└─────────────────────────────────────────────────────────────────┘

Vector Processing:
┌─────────────────────────────────────────────────────────────────┐
│  Vector Addition Example                                        │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │  a[0]   │  a[1]   │  a[2]   │  a[3]                       │
│  │    +     │    +     │    +     │    +                        │
│  │  b[0]   │  b[1]   │  b[2]   │  b[3]                       │
│  │    =     │    =     │    =     │    =                        │
│  │  c[0]   │  c[1]   │  c[2]   │  c[3]                       │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  Requires 1 vector instruction                                  │
└─────────────────────────────────────────────────────────────────┘

Vector Processing Applications

Vector processing is particularly effective for applications that exhibit:

Data Parallelism: Same operation applied to multiple data elements
Regular Memory Access Patterns: Predictable data access patterns
High Computational Density: Significant computation per data element
Large Data Sets: Operations on substantial amounts of data

Common applications include:

Image and video processing
Audio signal processing
Scientific computing and simulations
Cryptography and security
Machine learning and AI
Financial modeling

🚀 SIMD Instruction Sets

SIMD Fundamentals

SIMD (Single Instruction, Multiple Data) instruction sets are the hardware foundation for vector processing. These instruction sets extend traditional scalar instruction sets with instructions that operate on multiple data elements simultaneously.

SIMD instructions typically operate on:

Packed Data Types: Multiple data elements packed into wide registers
Vector Registers: Wide registers that can hold multiple data elements
Element-wise Operations: Operations applied to each element independently

Common SIMD Instruction Sets

SIMD Instruction Set Evolution:
┌─────────────────────────────────────────────────────────────────┐
│  x86 SIMD Evolution                                            │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ MMX     │ SSE     │ AVX     │  AVX-512                     │
│  │ (64-bit)│(128-bit)│(256-bit)│  (512-bit)                   │
│  │ 1997    │ 1999    │ 2011    │  2016                        │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  ARM SIMD Evolution                                             │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ NEON    │ SVE     │ SVE2    │  Future extensions           │
│  │(128-bit)│(variable)│(variable)│                              │
│  │ 2005    │ 2016    │ 2019    │                              │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

x86 SIMD Instructions

The x86 architecture has evolved through several SIMD instruction set extensions:

MMX (Multimedia Extensions)

64-bit packed integer operations
8-bit, 16-bit, and 32-bit integer data types
Basic arithmetic and logical operations

SSE (Streaming SIMD Extensions)

128-bit packed operations
Single-precision floating-point support
Enhanced integer operations
Memory alignment and streaming operations

AVX (Advanced Vector Extensions)

256-bit packed operations
Double-precision floating-point support
Fused multiply-add operations
Enhanced memory operations

AVX-512

512-bit packed operations
Advanced masking and predication
Enhanced floating-point precision
Specialized instructions for specific domains

ARM NEON Instructions

ARM NEON provides SIMD capabilities for ARM processors:

128-bit vector operations
Multiple data types: 8-bit, 16-bit, 32-bit, and 64-bit integers and floating-point
Advanced operations: Fused multiply-add, reciprocal, square root
Load/store operations: Interleaved and deinterleaved access patterns

SIMD Data Types

SIMD instructions support various packed data types:

SIMD Data Type Examples:
┌─────────────────────────────────────────────────────────────────┐
│  128-bit SIMD Register                                         │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 8-bit   │ 8-bit   │ 8-bit   │  8-bit                      │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (16x)  │  (16x)  │  (16x)  │   (16x)                     │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 16-bit  │ 16-bit  │ 16-bit  │  16-bit                     │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (8x)   │  (8x)   │  (8x)   │   (8x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 32-bit  │ 32-bit  │ 32-bit  │  32-bit                     │
│  │ Integer │ Integer │ Integer │  Integer                     │
│  │  (4x)   │  (4x)   │  (4x)   │   (4x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 64-bit  │ 64-bit  │         │                             │
│  │ Integer │ Integer │         │                             │
│  │  (2x)   │  (2x)   │         │                             │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ 32-bit  │ 32-bit  │ 32-bit  │  32-bit                     │
│  │ Float   │ Float   │ Float   │  Float                      │
│  │  (4x)   │  (4x)   │  (4x)   │   (4x)                      │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

🔄 Vectorization Techniques

Vectorization Fundamentals

Vectorization is the process of converting scalar code to use vector instructions. This transformation can be performed automatically by compilers or manually by programmers to improve performance.

Automatic Vectorization

Modern compilers can automatically vectorize loops that meet certain criteria:

Loop Independence: Loop iterations must be independent
Regular Memory Access: Predictable memory access patterns
Data Alignment: Data must be properly aligned for vector operations
Loop Bounds: Loop bounds must be known at compile time or runtime

Automatic Vectorization Example:
┌─────────────────────────────────────────────────────────────────┐
│  Original Scalar Code                                          │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  for (int i = 0; i < n; i++) {                             │
│  │      c[i] = a[i] + b[i];                                   │
│  │  }                                                          │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│  Compiler-Generated Vector Code                                 │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  // Vectorized loop (assuming 4-wide SIMD)                 │
│  │  for (int i = 0; i < n; i += 4) {                          │
│  │      // Load 4 elements from a and b                        │
│  │      __m128 va = _mm_load_ps(&a[i]);                       │
│  │      __m128 vb = _mm_load_ps(&b[i]);                       │
│  │      // Add vectors                                         │
│  │      __m128 vc = _mm_add_ps(va, vb);                       │
│  │      // Store result                                        │
│  │      _mm_store_ps(&c[i], vc);                              │
│  │  }                                                          │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Manual Vectorization

Manual vectorization gives programmers control over the vectorization process and can achieve better performance than automatic vectorization in many cases.

Intrinsic Functions

Intrinsic functions provide a C/C++ interface to SIMD instructions:

// x86 SSE intrinsic example
#include <immintrin.h>

void vector_add_sse(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_add_ps(va, vb);
        _mm_store_ps(&c[i], vc);
    }
}

Assembly Language

Direct assembly language programming provides maximum control:

; x86 SSE assembly example
vector_add_sse:
    mov eax, [esp+4]    ; a
    mov ebx, [esp+8]    ; b
    mov ecx, [esp+12]   ; c
    mov edx, [esp+16]   ; n
    
loop_start:
    movaps xmm0, [eax]  ; Load 4 floats from a
    movaps xmm1, [ebx]  ; Load 4 floats from b
    addps xmm0, xmm1    ; Add vectors
    movaps [ecx], xmm0  ; Store result to c
    
    add eax, 16         ; Advance pointers
    add ebx, 16
    add ecx, 16
    sub edx, 4          ; Decrement counter
    jnz loop_start
    
    ret

Vectorization Strategies

Loop Unrolling

Loop unrolling reduces loop overhead and improves instruction-level parallelism:

// Unrolled loop for better vectorization
void vector_add_unrolled(float* a, float* b, float* c, int n) {
    int i = 0;
    // Process 4 elements at a time
    for (; i <= n - 4; i += 4) {
        c[i] = a[i] + b[i];
        c[i+1] = a[i+1] + b[i+1];
        c[i+2] = a[i+2] + b[i+2];
        c[i+3] = a[i+3] + b[i+3];
    }
    // Handle remaining elements
    for (; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

Data Alignment

Proper data alignment is crucial for optimal vector performance:

// Aligned memory allocation for vector operations
#include <immintrin.h>

float* allocate_aligned_float(int n) {
    float* ptr = (float*)_aligned_malloc(n * sizeof(float), 16);
    return ptr;
}

void vector_add_aligned(float* a, float* b, float* c, int n) {
    // Ensure pointers are 16-byte aligned
    assert(((uintptr_t)a & 0xF) == 0);
    assert(((uintptr_t)b & 0xF) == 0);
    assert(((uintptr_t)c & 0xF) == 0);
    
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_add_ps(va, vb);
        _mm_store_ps(&c[i], vc);
    }
}

🏗️ Vector Processing Architectures

Vector Register Architecture

Vector processing architectures typically include:

Vector Registers: Wide registers for storing vector data
Vector Functional Units: Specialized execution units for vector operations
Memory Interface: High-bandwidth memory access for vector loads/stores
Control Logic: Hardware for managing vector operations

Vector Register Architecture:
┌─────────────────────────────────────────────────────────────────┐
│  Vector Processing Unit                                        │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Vector      │ Vector      │  Vector                         │ │
│  │ Registers   │ Functional  │  Memory                         │ │
│  │ (V0-V31)    │ Units       │  Interface                      │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Vector      │ Vector      │  Vector                         │ │
│  │ Load/Store  │ Arithmetic  │  Control                        │ │
│  │ Unit        │ Unit        │  Logic                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Vector Memory Access Patterns

Vector processors support various memory access patterns:

Unit Stride: Consecutive memory locations
Strided: Fixed stride between elements
Gather/Scatter: Non-contiguous memory access
Interleaved: Alternating data from different arrays

Memory Access Patterns:
┌─────────────────────────────────────────────────────────────────┐
│  Unit Stride Access                                            │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ Memory  │ Address │ Data    │  Pattern                     │ │
│  │ 0x1000  │ 0x1000  │ a[0]    │  Consecutive addresses      │ │
│  │ 0x1004  │ 0x1004  │ a[1]    │                             │ │
│  │ 0x1008  │ 0x1008  │ a[2]    │                             │ │
│  │ 0x100C  │ 0x100C  │ a[3]    │                             │ │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Strided Access                                                │
│  ┌─────────┬─────────┬─────────┬─────────────────────────────┐ │
│  │ Memory  │ Address │ Data    │  Pattern                     │ │
│  │ 0x1000  │ 0x1000  │ a[0]    │  Fixed stride of 8 bytes    │ │
│  │ 0x1008  │ 0x1008  │ a[1]    │                             │ │
│  │ 0x1010  │ 0x1010  │ a[2]    │                             │ │
│  │ 0x1018  │ 0x1018  │ a[3]    │                             │ │
│  └─────────┴─────────┴─────────┴─────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Vector Length and Masking

Modern vector architectures support variable-length vectors and predicated execution:

Variable Length: Vector length can be adjusted at runtime
Predication: Conditional execution of vector elements
Masking: Control which elements participate in operations

⚡ Performance Optimization

Vectorization Efficiency

Vectorization efficiency measures how well scalar code is converted to vector operations:

Vectorization Efficiency = (Vector Operations / Total Operations) × 100%

Example:
- Original scalar code: 1000 operations
- Vectorized code: 250 vector operations (4-wide SIMD)
- Efficiency: (250/1000) × 100% = 25%

Memory Bandwidth Optimization

Vector operations are often memory-bound, making memory bandwidth optimization crucial:

Cache Optimization: Maximize cache utilization
Memory Access Patterns: Use cache-friendly access patterns
Data Locality: Minimize cache misses
Prefetching: Use hardware and software prefetching

Instruction-Level Parallelism

Vector operations can be combined with other optimization techniques:

Loop Unrolling: Reduce loop overhead
Software Pipelining: Overlap loop iterations
Register Blocking: Maximize register utilization
Instruction Scheduling: Optimize instruction ordering

🔧 Vector Processing in Embedded Systems

Embedded Vector Processing

Embedded systems often have specific requirements for vector processing:

Power Efficiency: Minimize power consumption
Real-time Performance: Meet timing constraints
Memory Constraints: Work within limited memory
Cost Sensitivity: Balance performance and cost

ARM NEON in Embedded Systems

ARM NEON is widely used in embedded systems due to its power efficiency and performance:

// ARM NEON example for embedded systems
#include <arm_neon.h>

void vector_add_neon(float32_t* a, float32_t* b, float32_t* c, int n) {
    for (int i = 0; i < n; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vc = vaddq_f32(va, vb);
        vst1q_f32(&c[i], vc);
    }
}

DSP Vector Processing

Digital Signal Processors (DSPs) often include specialized vector instructions:

MAC Operations: Multiply-accumulate operations
Circular Buffering: Efficient buffer management
Bit Manipulation: Specialized bit operations
Fixed-point Arithmetic: Optimized for fixed-point operations

💻 Programming Models

Vector Programming Models

Several programming models support vector processing:

Intrinsic Functions: C/C++ extensions for SIMD instructions
Auto-vectorization: Compiler-based vectorization
Vector Libraries: High-level vector operations
Domain-Specific Languages: Specialized languages for vector processing

OpenMP Vectorization

OpenMP provides directives for vectorization:

#include <omp.h>

void vector_add_openmp(float* a, float* b, float* c, int n) {
    #pragma omp simd
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

SIMD Libraries

Various libraries provide high-level SIMD operations:

Intel MKL: Math Kernel Library
ARM Compute Library: ARM-optimized operations
OpenBLAS: Open Basic Linear Algebra Subprograms
Eigen: C++ template library for linear algebra

🚀 Future Trends

Advanced Vector Extensions

Future vector processing developments may include:

Variable-Length Vectors: Dynamic vector length adjustment
Advanced Predication: Sophisticated conditional execution
Domain-Specific Instructions: Specialized instructions for specific applications
AI/ML Acceleration: Vector instructions optimized for machine learning

Heterogeneous Vector Processing

Future systems may combine different types of vector processors:

CPU + GPU Integration: Unified vector processing
Specialized Accelerators: Domain-specific vector units
Neuromorphic Computing: Brain-inspired vector processing
Quantum-Classical Hybrid: Quantum-inspired classical vector operations

📚 Further Reading and Resources

Computer Architecture: A Quantitative Approach by Hennessy and Patterson
Vector Processing by Kogge
SIMD Programming Manuals (Intel, ARM)
OpenMP Specification
Vectorization Best Practices Guides

This comprehensive guide to Vector Processing provides the foundation for understanding how modern processors achieve high performance through data parallelism. The concepts covered here are essential for embedded software engineers working with performance-critical applications and understanding vector processing capabilities.