The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Performance Counters

Understanding CPU Profiling and Performance Monitoring
Comprehensive coverage of performance counters, profiling techniques, and performance analysis tools


📋 Table of Contents


🏗️ Performance Counters Fundamentals

What are Performance Counters?

Performance counters are hardware registers that count specific events occurring in the processor, such as cache misses, branch mispredictions, instruction executions, and memory accesses. These counters provide detailed insights into the performance characteristics of programs and help identify bottlenecks and optimization opportunities.

Performance counters are essential tools for performance analysis, as they provide low-level, accurate measurements of hardware behavior that cannot be easily obtained through other means. They enable developers to understand how their code interacts with the underlying hardware and make informed optimization decisions.

Performance Counter Philosophy

Performance counters embody the principle of measurement-driven optimization, where performance improvements are based on actual measurements rather than assumptions or intuition. This approach provides several key benefits:

  1. Objective Measurement: Quantifiable performance data
  2. Bottleneck Identification: Precise location of performance issues
  3. Optimization Validation: Proof that optimizations actually improve performance
  4. Hardware Understanding: Deep insight into hardware behavior

Performance Analysis Workflow

Performance Analysis Workflow:
┌─────────────────────────────────────────────────────────────────┐
│  Performance Analysis Process                                   │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1. Profile │ 2. Analyze  │  3. Optimize                     │ │
│  │    Code     │   Results   │    Code                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Collect     │ Identify    │  Implement                      │ │
│  │ Performance │ Bottlenecks │  Improvements                    │ │
│  │ Data        │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Measure     │ Validate    │  Iterate                        │ │
│  │ Improvement │ Results     │  Process                        │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Types of Performance Counters

Performance counters can be categorized into several types:

  1. Fixed Counters: Always available, count fundamental events
  2. Programmable Counters: Can be configured to count specific events
  3. General-Purpose Counters: Flexible counters for various event types
  4. Specialized Counters: Counters for specific domains (e.g., memory, cache)

⚙️ Hardware Performance Counters

Performance Counter Architecture

Modern processors include dedicated hardware for performance monitoring:

Performance Counter Architecture:
┌─────────────────────────────────────────────────────────────────┐
│  Performance Monitoring Unit (PMU)                             │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Event       │ Counter     │  Control                        │ │
│  │ Select      │ Registers   │  Registers                      │ │
│  │ Logic       │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Event       │ Counter     │  Interrupt                      │ │
│  │ Detection   │ Overflow    │  Generation                     │ │
│  │ Logic       │ Logic       │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Performance Counter Registers

Performance counters typically include several types of registers:

  1. Event Select Registers: Configure which events to count
  2. Counter Registers: Store the actual count values
  3. Control Registers: Enable/disable counters and configure behavior
  4. Status Registers: Monitor counter state and overflow conditions

Performance Counter Modes

Performance counters can operate in several modes:

  1. Counting Mode: Simple event counting
  2. Sampling Mode: Generate interrupts at specific count thresholds
  3. Trace Mode: Detailed event tracing
  4. Profiling Mode: Statistical sampling for profiling

📊 Performance Monitoring Events

CPU Performance Events

CPU performance counters monitor various processor events:

Instruction-Level Events

  1. Instructions Retired: Total instructions completed
  2. Cycles: CPU cycles consumed
  3. IPC (Instructions Per Cycle): Performance efficiency metric
  4. Branch Instructions: Total branch instructions executed

Cache Performance Events

  1. Cache References: Memory accesses that hit cache
  2. Cache Misses: Memory accesses that miss cache
  3. Cache Miss Rate: Percentage of cache misses
  4. Cache Line Evictions: Cache lines removed from cache

Memory Performance Events

  1. Memory Accesses: Total memory operations
  2. Memory Stalls: Cycles waiting for memory
  3. TLB Misses: Translation lookaside buffer misses
  4. Page Faults: Virtual memory page faults

Branch Prediction Events

  1. Branch Predictions: Total branch predictions made
  2. Branch Mispredictions: Incorrect branch predictions
  3. Branch Prediction Accuracy: Percentage of correct predictions
  4. Branch Target Buffer Misses: BTB lookup failures

Event Categories by Architecture

Different processor architectures provide different sets of performance events:

x86 Performance Events

// Common x86 performance events
#define X86_EVENT_CPU_CYCLES          0x003C
#define X86_EVENT_INSTRUCTIONS        0x00C0
#define X86_EVENT_CACHE_REFERENCES    0x4F2E
#define X86_EVENT_CACHE_MISSES        0x2E41
#define X86_EVENT_BRANCH_INSTRUCTIONS 0x00C4
#define X86_EVENT_BRANCH_MISSES       0x00C5
#define X86_EVENT_PAGE_FAULTS         0x0005
#define X86_EVENT_CONTEXT_SWITCHES    0x0006

ARM Performance Events

// Common ARM performance events
#define ARM_EVENT_CPU_CYCLES          0x11
#define ARM_EVENT_INSTRUCTIONS        0x08
#define ARM_EVENT_CACHE_REFERENCES    0x04
#define ARM_EVENT_CACHE_MISSES        0x03
#define ARM_EVENT_BRANCH_INSTRUCTIONS 0x07
#define ARM_EVENT_BRANCH_MISSES       0x06
#define ARM_EVENT_MEMORY_ACCESSES     0x05
#define ARM_EVENT_MEMORY_STALLS       0x0A

Performance Event Configuration

Performance events can be configured with various parameters:

  1. Event Selection: Choose which specific event to count
  2. Counter Masking: Filter events based on specific conditions
  3. User/Kernel Mode: Count events in user space, kernel space, or both
  4. Interrupt Generation: Generate interrupts at specific count thresholds

🔍 Profiling Techniques

Profiling Fundamentals

Profiling is the process of collecting performance data to understand program behavior and identify optimization opportunities. Different profiling techniques provide different levels of detail and overhead.

Statistical Profiling

Statistical profiling samples program execution at regular intervals to create a statistical profile of performance:

Statistical Profiling:
┌─────────────────────────────────────────────────────────────────┐
│  Sampling Process                                              │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Timer       │ Sample      │  Profile                       │ │
│  │ Interrupt   │ Collection  │  Generation                     │ │
│  │ (e.g., 1ms) │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Record      │ Analyze     │  Generate                       │ │
│  │ Program     │ Samples     │  Report                         │ │
│  │ Counter     │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Event-Based Profiling

Event-based profiling uses performance counters to trigger profiling at specific event occurrences:

  1. Cache Miss Profiling: Profile when cache misses occur
  2. Branch Misprediction Profiling: Profile when branches are mispredicted
  3. Memory Access Profiling: Profile memory access patterns
  4. Exception Profiling: Profile when exceptions occur

Call Graph Profiling

Call graph profiling tracks function call relationships and execution time:

Call Graph Example:
┌─────────────────────────────────────────────────────────────────┐
│  Function Call Graph                                           │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ main()      │ 100ms       │  Total execution time           │ │
│  │ ├─func1()   │ 60ms        │  ├─func1: 60% of time           │ │
│  │ │ ├─func2() │ 30ms        │  │ ├─func2: 30% of time         │ │
│  │ │ └─func3() │ 30ms        │  │ └─func3: 30% of time         │ │
│  │ └─func4()   │ 40ms        │  └─func4: 40% of time           │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Memory Profiling

Memory profiling focuses on memory usage patterns and performance:

  1. Allocation Profiling: Track memory allocation patterns
  2. Access Pattern Profiling: Analyze memory access locality
  3. Cache Behavior Profiling: Understand cache utilization
  4. Memory Bandwidth Profiling: Measure memory throughput

🛠️ Performance Analysis Tools

Linux Performance Tools

Linux provides several built-in performance analysis tools:

perf

The perf tool is a comprehensive performance analysis framework:

# Basic performance profiling
perf stat ./program

# Event-based sampling
perf record -e cache-misses ./program

# Call graph profiling
perf record -g ./program

# Performance report
perf report

# Real-time monitoring
perf top

SystemTap

SystemTap provides dynamic tracing capabilities:

# Trace system calls
stap -e 'probe syscall.* { printf("%s\n", name) }'

# Profile function calls
stap -e 'probe kernel.function("sys_open") { printf("open called\n") }'

Intel Performance Tools

Intel provides specialized performance analysis tools:

Intel VTune Profiler

Intel VTune provides comprehensive performance analysis:

# Command-line profiling
vtune -collect hotspots ./program

# Memory access analysis
vtune -collect memory-access ./program

# Cache analysis
vtune -collect cache-misses ./program

Intel SDE (Software Development Emulator)

Intel SDE provides detailed instruction-level analysis:

# Instruction counting
sde -icount ./program

# Memory access tracing
sde -mix ./program

ARM Performance Tools

ARM provides performance analysis tools for ARM processors:

ARM Streamline

ARM Streamline provides system-wide performance analysis:

# System performance analysis
streamline -c config.xml

# Application profiling
streamline -a app_name

ARM Performance Monitor Unit (PMU)

ARM PMU provides low-level performance monitoring:

# PMU event counting
perf stat -e armv8_pmuv3_0/cycles/ ./program

# PMU sampling
perf record -e armv8_pmuv3_0/cycles/ ./program

💻 Performance Counter Programming

Direct Performance Counter Access

Programs can directly access performance counters for custom profiling:

x86 Performance Counter Programming

#include <stdint.h>
#include <x86intrin.h>

// Performance counter structure
typedef struct {
    uint64_t start_cycles;
    uint64_t start_instructions;
} perf_counter_t;

// Start performance measurement
void perf_start(perf_counter_t* counter) {
    counter->start_cycles = __rdtsc();
    counter->start_instructions = __rdtsc(); // Simplified for example
}

// Stop performance measurement
void perf_stop(perf_counter_t* counter) {
    uint64_t end_cycles = __rdtsc();
    uint64_t end_instructions = __rdtsc(); // Simplified for example
    
    uint64_t cycles = end_cycles - counter->start_cycles;
    uint64_t instructions = end_instructions - counter->start_instructions;
    
    printf("Cycles: %lu, Instructions: %lu, IPC: %.2f\n",
           cycles, instructions, (double)instructions / cycles);
}

ARM Performance Counter Programming

#include <stdint.h>

// ARM PMU access (requires kernel support)
#define ARM_PMU_PMUSERENR_EL0  "p15, 0, %0, c9, c14, 0"
#define ARM_PMU_PMCR_EL0       "p15, 0, %0, c9, c12, 0"
#define ARM_PMU_PMCNTENSET_EL0 "p15, 0, %0, c9, c12, 1"
#define ARM_PMU_PMCCNTR_EL0    "p15, 0, %0, c9, c13, 0"

// Read ARM PMU register
static inline uint32_t arm_pmu_read(const char* reg) {
    uint32_t value;
    __asm__ __volatile__("mrc " reg : "=r" (value));
    return value;
}

// Write ARM PMU register
static inline void arm_pmu_write(const char* reg, uint32_t value) {
    __asm__ __volatile__("mcr " reg : : "r" (value));
}

// Enable ARM PMU
void arm_pmu_enable() {
    // Enable user mode access
    arm_pmu_write(ARM_PMU_PMUSERENR_EL0, 1);
    
    // Enable cycle counter
    arm_pmu_write(ARM_PMU_PMCNTENSET_EL0, 1);
    
    // Reset cycle counter
    arm_pmu_write(ARM_PMU_PMCR_EL0, 1);
}

// Read cycle counter
uint64_t arm_pmu_read_cycles() {
    return arm_pmu_read(ARM_PMU_PMCCNTR_EL0);
}

Performance Counter Libraries

Several libraries provide portable performance counter access:

PAPI (Performance Application Programming Interface)

PAPI provides a portable interface to hardware performance counters:

#include <papi.h>

void papi_example() {
    int events[2] = {PAPI_TOT_CYC, PAPI_TOT_INS};
    long long values[2];
    
    // Start counting
    PAPI_start_counters(events, 2);
    
    // Your code here
    perform_work();
    
    // Stop counting
    PAPI_stop_counters(values, 2);
    
    printf("Cycles: %lld, Instructions: %lld\n", values[0], values[1]);
}

Linux perf_event_open

Linux provides a system call interface to performance counters:

#include <linux/perf_event.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <fcntl.h>

int perf_event_open_example() {
    struct perf_event_attr pe;
    memset(&pe, 0, sizeof(pe));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(pe);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    pe.exclude_hv = 1;
    
    int fd = syscall(__NR_perf_event_open, &pe, -1, 0, -1, 0);
    if (fd == -1) {
        perror("perf_event_open");
        return -1;
    }
    
    // Start counting
    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
    
    // Your code here
    perform_work();
    
    // Stop counting
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    
    // Read result
    long long count;
    read(fd, &count, sizeof(count));
    printf("Cycles: %lld\n", count);
    
    close(fd);
    return 0;
}

Performance Optimization

Performance Analysis Workflow

Effective performance optimization follows a systematic approach:

Performance Optimization Process:
┌─────────────────────────────────────────────────────────────────┐
│  Optimization Workflow                                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1. Baseline │ 2. Profile  │  3. Identify                    │ │
│  │    Measure  │   Code      │    Bottlenecks                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 4. Optimize │ 5. Measure  │  6. Validate                    │ │
│  │    Code     │   Results   │    Improvement                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│           │             │             │                         │
│           ▼             ▼             ▼                         │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 7. Iterate  │ 8. Document │  9. Monitor                     │ │
│  │    Process  │   Changes   │    Performance                  │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Common Performance Bottlenecks

Performance counters help identify several common bottlenecks:

Cache Performance Issues

// Cache-friendly matrix multiplication
void cache_friendly_multiply(float* A, float* B, float* C, int n) {
    const int BLOCK_SIZE = 32; // Optimized for L1 cache
    
    for (int i = 0; i < n; i += BLOCK_SIZE) {
        for (int j = 0; j < n; j += BLOCK_SIZE) {
            for (int k = 0; k < n; k += BLOCK_SIZE) {
                // Process blocks to maximize cache utilization
                for (int ii = i; ii < min(i + BLOCK_SIZE, n); ii++) {
                    for (int jj = j; jj < min(j + BLOCK_SIZE, n); jj++) {
                        float sum = C[ii * n + jj];
                        for (int kk = k; kk < min(k + BLOCK_SIZE, n); kk++) {
                            sum += A[ii * n + kk] * B[kk * n + jj];
                        }
                        C[ii * n + jj] = sum;
                    }
                }
            }
        }
    }
}

Branch Prediction Issues

// Branch-friendly code organization
void branch_friendly_sort(int* array, int n) {
    // Separate positive and negative numbers
    int* positive = malloc(n * sizeof(int));
    int* negative = malloc(n * sizeof(int));
    int pos_count = 0, neg_count = 0;
    
    // First pass: separate numbers (fewer branches)
    for (int i = 0; i < n; i++) {
        if (array[i] >= 0) {
            positive[pos_count++] = array[i];
        } else {
            negative[neg_count++] = array[i];
        }
    }
    
    // Sort positive and negative separately
    sort_positive(positive, pos_count);
    sort_negative(negative, neg_count);
    
    // Combine results
    memcpy(array, negative, neg_count * sizeof(int));
    memcpy(array + neg_count, positive, pos_count * sizeof(int));
    
    free(positive);
    free(negative);
}

Performance Monitoring in Production

Performance monitoring should continue in production environments:

  1. Continuous Monitoring: Track performance metrics over time
  2. Alerting: Set thresholds for performance degradation
  3. Trend Analysis: Identify performance trends and patterns
  4. Capacity Planning: Use performance data for capacity planning

🔧 Embedded System Considerations

Embedded Performance Constraints

Embedded systems have specific performance monitoring considerations:

  1. Resource Limitations: Limited memory and processing power for monitoring
  2. Real-time Requirements: Monitoring overhead must not affect timing
  3. Power Constraints: Performance monitoring consumes power
  4. Cost Sensitivity: Monitoring hardware adds cost

Lightweight Performance Monitoring

Embedded systems require lightweight monitoring approaches:

// Lightweight performance counter for embedded systems
typedef struct {
    uint32_t start_time;
    uint32_t start_cycles;
    uint32_t start_instructions;
} lightweight_perf_t;

// Start lightweight measurement
void lightweight_perf_start(lightweight_perf_t* perf) {
    perf->start_time = get_system_time();
    perf->start_cycles = get_cycle_count();
    perf->start_instructions = get_instruction_count();
}

// Stop lightweight measurement
void lightweight_perf_stop(lightweight_perf_t* perf) {
    uint32_t end_time = get_system_time();
    uint32_t end_cycles = get_cycle_count();
    uint32_t end_instructions = get_instruction_count();
    
    uint32_t time_us = end_time - perf->start_time;
    uint32_t cycles = end_cycles - perf->start_cycles;
    uint32_t instructions = end_instructions - perf->start_instructions;
    
    // Log minimal performance data
    log_performance(time_us, cycles, instructions);
}

Real-time Performance Monitoring

Real-time systems require non-intrusive performance monitoring:

  1. Sampling-Based Monitoring: Use periodic sampling to minimize overhead
  2. Event-Driven Monitoring: Monitor only when specific events occur
  3. Background Monitoring: Run monitoring in low-priority background tasks
  4. Selective Monitoring: Monitor only critical performance aspects

📚 Further Reading and Resources


This comprehensive guide to Performance Counters provides the foundation for understanding how to measure and analyze program performance. The concepts covered here are essential for embedded software engineers working on performance-critical applications and understanding hardware behavior.