The Embedded New Testament

The "Holy Bible" for embedded engineers

Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Code Optimization Techniques

Quick Reference: Key Facts

Algorithmic optimization provides orders-of-magnitude improvements vs. 10-20% from other techniques
Compiler optimization can outperform hand-optimized assembly when code is written clearly
Memory access patterns often impact performance more than algorithmic complexity
Loop optimization (unrolling, vectorization) targets the most performance-critical code sections
Function inlining eliminates call overhead but increases code size
Branch prediction optimization makes common cases the first branch
SIMD instructions can process multiple data elements simultaneously
Compiler flags balance performance vs. compilation time and reliability

The Foundation of Performance Optimization

Code optimization represents the most fundamental level of performance improvement in embedded systems, where the choice of algorithms, data structures, and compiler configurations can have orders-of-magnitude impact on system performance. Unlike other optimization techniques that might provide 10-20% improvements, algorithmic optimization can transform an unusable system into a highly efficient one. This makes it the first and most important consideration in any optimization effort.

The optimization process begins with understanding that performance is not a single metric but a complex interplay of multiple factors: execution speed, memory usage, power consumption, and real-time responsiveness. Each of these factors can become a bottleneck depending on the specific requirements of the application. A system optimized for speed might consume excessive power, while a system optimized for power might fail to meet real-time deadlines. The art of optimization lies in finding the right balance for each specific use case.

Core Concepts

Concept: Algorithmic Complexity vs. Real Performance

Why it matters: Big-O notation provides theoretical guidance, but constant factors, cache behavior, and data characteristics determine real-world performance.

Minimal example:

// O(n²) but cache-friendly vs O(n log n) but cache-unfriendly
void cache_friendly_sort(int arr[], int n) {
    // Bubble sort - O(n²) but excellent cache locality
    for (int i = 0; i < n-1; i++) {
        for (int j = 0; j < n-i-1; j++) {
            if (arr[j] > arr[j+1]) {
                int temp = arr[j];
                arr[j] = arr[j+1];
                arr[j+1] = temp;
            }
        }
    }
}

Try it: Profile both algorithms with different data sizes and cache configurations.

Takeaways: Cache behavior often dominates algorithmic complexity for small to medium datasets.

Concept: Compiler Optimization Leverage

Why it matters: Modern compilers can transform naive code into highly efficient machine code, often outperforming hand-optimized assembly.

Minimal example:

// Let the compiler optimize this
int sum_array(int arr[], int n) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        sum += arr[i];
    }
    return sum;
}

// Compiler can vectorize, unroll, and optimize this automatically

Try it: Compare assembly output with different optimization levels (-O0, -O2, -O3).

Takeaways: Write clear, predictable code and let the compiler do the heavy lifting.

Concept: Memory Access Patterns

Why it matters: Memory access patterns often impact performance more than algorithmic complexity due to cache behavior.

Minimal example:

// Good: Row-major access (cache-friendly)
int sum_matrix_good(int matrix[][100], int rows) {
    int sum = 0;
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < 100; j++) {
            sum += matrix[i][j];  // Sequential memory access
        }
    }
    return sum;
}

// Bad: Column-major access (cache-unfriendly)
int sum_matrix_bad(int matrix[][100], int rows) {
    int sum = 0;
    for (int j = 0; j < 100; j++) {
        for (int i = 0; i < rows; i++) {
            sum += matrix[i][j];  // Strided memory access
        }
    }
    return sum;
}

Try it: Benchmark both functions with different matrix sizes.

Takeaways: Access data in the order it’s stored in memory.

Algorithmic Optimization: The Foundation of Performance

Algorithmic optimization represents the most fundamental level of performance improvement, where the choice of algorithms and data structures can have orders-of-magnitude impact on system performance. Unlike other optimization techniques that might provide 10-20% improvements, algorithmic optimization can transform an unusable system into a highly efficient one. This makes it the first and most important consideration in any optimization effort.

The choice of algorithms begins with understanding the computational complexity of different approaches. Big-O notation provides a theoretical framework for comparing algorithms, but practical performance depends on many factors beyond just the asymptotic complexity. Constant factors, cache behavior, and the specific characteristics of the input data all play crucial roles in determining real-world performance.

For embedded systems, the choice of algorithms must consider not just computational complexity but also memory usage, power consumption, and real-time characteristics. A sorting algorithm that uses O(n log n) time but requires O(n) additional memory might be preferable to an in-place algorithm that uses O(n²) time, especially in memory-constrained systems. Similarly, an algorithm that provides consistent performance might be preferable to one that is faster on average but has unpredictable worst-case behavior.

Compiler Optimization: Leveraging Automated Intelligence

Compiler optimization represents one of the most powerful tools available to embedded system developers, providing automated performance improvements that would be extremely difficult or impossible to achieve through manual optimization. Modern compilers incorporate sophisticated optimization techniques that can transform naive code into highly efficient machine code, often outperforming hand-optimized assembly language.

The effectiveness of compiler optimization depends on the quality of the source code and the compiler’s ability to understand the programmer’s intent. Compilers work best when the code is written clearly and follows predictable patterns. Complex, convoluted code that tries to be clever often prevents the compiler from applying its optimizations effectively. This creates a paradox where the most “optimized” hand-written code often performs worse than simple, clear code that allows the compiler to do its job.

Compiler optimization operates at multiple levels, from basic local optimizations to sophisticated global transformations. Local optimizations include constant folding, dead code elimination, and basic block optimization. These optimizations are applied to small sections of code and are generally safe and predictable. Global optimizations include function inlining, loop optimization, and inter-procedural optimization. These optimizations can have dramatic effects on performance but may also introduce unexpected behavior if not carefully controlled.

Compiler Flags and Optimization Levels

The choice of compiler flags and optimization levels represents a critical decision in the optimization process. Higher optimization levels generally provide better performance but may also increase compilation time, memory usage during compilation, and the risk of introducing bugs. For embedded systems, the choice of optimization level must balance performance requirements with development time and system reliability.

Common optimization flags include:

-O0: No optimization (default, fastest compilation)
-O1: Basic optimizations (faster compilation, some performance gain)
-O2: More aggressive optimizations (slower compilation, better performance)
-O3: Most aggressive optimizations (slowest compilation, best performance)
-Os: Optimize for size (good for memory-constrained systems)
-Ofast: Aggressive optimizations that may break standards compliance

Additional optimization flags can target specific areas:

-ffast-math: Optimize floating-point operations (may affect precision)
-funroll-loops: Unroll loops for better performance
-finline-functions: Inline small functions
-fomit-frame-pointer: Omit frame pointer for better performance
-march=native: Optimize for the target processor architecture

Loop Optimization Techniques

Loops are often the most performance-critical sections of embedded code, and optimizing them can provide significant performance improvements. The compiler can perform many loop optimizations automatically, but understanding these techniques helps write code that the compiler can optimize effectively.

Loop unrolling is a technique where the compiler replicates loop body code to reduce loop overhead. This is most effective for loops with a small, known number of iterations. The compiler can often determine the optimal unroll factor based on the target architecture and loop characteristics.

Loop vectorization transforms scalar operations into vector operations that can process multiple data elements simultaneously. This is particularly effective for loops that perform the same operation on arrays of data. Modern compilers can automatically vectorize many loops, but the code must be written in a way that allows the compiler to recognize vectorization opportunities.

Function Optimization

Function calls introduce overhead in embedded systems, and optimizing function usage can significantly improve performance. Inlining is one of the most effective function optimizations, where the compiler replaces function calls with the actual function code. This eliminates call overhead and allows the compiler to optimize the function code in the context of the calling code.

The compiler can automatically inline small functions, but the inline keyword can suggest inlining for larger functions when appropriate. However, excessive inlining can increase code size, so the trade-off between performance and memory usage must be considered.

Function specialization is another optimization technique where the compiler creates specialized versions of functions for specific argument types or values. This can eliminate runtime type checking and enable more aggressive optimizations. Template functions in C++ are a common way to enable function specialization.

Memory Access Optimization

Memory access patterns significantly impact performance in modern embedded systems due to the memory hierarchy and cache behavior. Optimizing memory access can provide substantial performance improvements, often more than algorithmic optimizations.

Data locality is a key principle in memory access optimization. Programs should access data that is close together in memory, taking advantage of spatial locality. Similarly, programs should reuse data that has been recently accessed, taking advantage of temporal locality. These optimizations help maintain data in cache and reduce expensive memory accesses.

Array access patterns are particularly important for performance. Row-major access patterns are optimal for C/C++ arrays, while column-major access patterns are optimal for Fortran arrays. Understanding the memory layout of data structures and accessing them accordingly can provide significant performance improvements.

Branch Prediction Optimization

Modern processors use sophisticated branch prediction to maintain high performance, but poor branch patterns can significantly reduce performance. Optimizing branch patterns helps the processor make better predictions and maintain high instruction throughput.

The most important branch optimization is to make the most common case the first branch. This helps the processor’s static branch predictor make better decisions. For dynamic branches, providing hints to the processor can improve prediction accuracy, though the effectiveness varies by processor architecture.

Eliminating unnecessary branches can also improve performance. This can be done by using conditional moves instead of conditional branches, or by restructuring code to avoid branches entirely. The compiler can often perform these optimizations automatically, but understanding the techniques helps write code that the compiler can optimize effectively.

Instruction-Level Optimizations

Modern processors can execute multiple instructions per cycle through various techniques such as superscalar execution, out-of-order execution, and instruction-level parallelism. Optimizing code to take advantage of these features can provide significant performance improvements.

Instruction scheduling is critical for performance. The compiler can often reorder instructions to improve pipeline utilization and reduce stalls. However, the programmer can help by writing code that provides clear dependencies and avoids unnecessary serialization.

SIMD (Single Instruction, Multiple Data) instructions can process multiple data elements simultaneously, providing significant performance improvements for data-parallel operations. Modern compilers can automatically vectorize many loops to use SIMD instructions, but the code must be written in a way that allows the compiler to recognize vectorization opportunities.

Visual Representations

Optimization Impact Hierarchy

Performance Impact
    │
    ├── Algorithmic (10x - 100x)
    ├── Memory Access (2x - 10x)
    ├── Compiler Optimization (1.5x - 3x)
    ├── Loop Optimization (1.2x - 2x)
    └── Instruction-Level (1.1x - 1.5x)

Compiler Optimization Flow

Source Code → Parse → Optimize → Generate Assembly
    │           │        │           │
    │           │        ├── Local (Safe)
    │           │        ├── Global (Risky)
    │           │        └── Target-Specific
    │           └── AST
    └── Compiler Flags

Memory Access Pattern Comparison

Row-Major (Good):     Column-Major (Bad):
[1][2][3][4]         [1][5][9][13]
[5][6][7][8]         [2][6][10][14]
[9][10][11][12]      [3][7][11][15]
[13][14][15][16]     [4][8][12][16]

Cache hits: ████████   Cache hits: ██
Cache misses: ██       Cache misses: ████████

Guided Labs

Lab 1: Compiler Optimization Analysis

Setup: Create a simple function with loops and function calls
Compile: Use different optimization levels (-O0, -O1, -O2, -O3)
Analyze: Compare assembly output and execution time
Document: Note which optimizations the compiler applied

Lab 2: Memory Access Pattern Impact

Implement: Both row-major and column-major matrix operations
Profile: Use cache profiling tools (perf, valgrind)
Measure: Execution time with different data sizes
Analyze: When does cache behavior dominate?

Lab 3: Algorithmic vs. Implementation Trade-offs

Compare: Simple O(n²) algorithm vs. complex O(n log n) algorithm
Profile: Memory usage, cache misses, execution time
Vary: Data sizes from small (fits in cache) to large (exceeds cache)
Conclude: When does each approach win?

Check Yourself

Understanding Check

Can you explain why O(n²) might be faster than O(n log n) for small datasets?
Do you understand when to let the compiler optimize vs. manual optimization?
Can you identify cache-friendly vs. cache-unfriendly memory access patterns?
Do you know which compiler flags to use for different optimization goals?

Application Check

Can you profile code to identify the actual performance bottlenecks?
Can you restructure loops to improve cache locality?
Can you choose appropriate optimization levels for your target system?
Can you balance performance vs. code size vs. compilation time?

Analysis Check

Can you analyze assembly output to understand compiler optimizations?
Can you use profiling tools to measure cache performance?
Can you identify when algorithmic changes vs. implementation changes are needed?
Can you measure the real-world impact of optimizations?

Cross-links

Memory Management - Understanding memory layout and allocation
Performance Profiling - Measuring optimization effectiveness
Build Systems - Configuring compiler optimization
Real-Time Systems - Performance requirements and constraints
Hardware Fundamentals - Understanding system timing

Conclusion

Code optimization techniques provide the foundation for high-performance embedded systems. Algorithmic optimization can provide orders-of-magnitude improvements, while compiler optimization can provide significant additional improvements with minimal effort. The key is to understand the optimization techniques available and apply them systematically based on the specific requirements and constraints of the target system.

The most effective optimization approach combines algorithmic improvements with compiler optimization. Algorithmic improvements provide the foundation for good performance, while compiler optimization can extract additional performance from the resulting code. This combination often provides the best results with the least development effort.

As embedded systems become more complex and demanding, the importance of effective code optimization will only increase. The continued development of optimization tools and techniques will provide new opportunities for improving system performance, but the fundamental principles of algorithmic optimization and compiler optimization will remain the foundation of effective optimization.