The "Holy Bible" for embedded engineers
Understanding CPU Profiling and Performance Monitoring
Comprehensive coverage of performance counters, profiling techniques, and performance analysis tools
Performance counters are hardware registers that count specific events occurring in the processor, such as cache misses, branch mispredictions, instruction executions, and memory accesses. These counters provide detailed insights into the performance characteristics of programs and help identify bottlenecks and optimization opportunities.
Performance counters are essential tools for performance analysis, as they provide low-level, accurate measurements of hardware behavior that cannot be easily obtained through other means. They enable developers to understand how their code interacts with the underlying hardware and make informed optimization decisions.
Performance counters embody the principle of measurement-driven optimization, where performance improvements are based on actual measurements rather than assumptions or intuition. This approach provides several key benefits:
Performance Analysis Workflow:
┌─────────────────────────────────────────────────────────────────┐
│ Performance Analysis Process │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1. Profile │ 2. Analyze │ 3. Optimize │ │
│ │ Code │ Results │ Code │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Collect │ Identify │ Implement │ │
│ │ Performance │ Bottlenecks │ Improvements │ │
│ │ Data │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Measure │ Validate │ Iterate │ │
│ │ Improvement │ Results │ Process │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Performance counters can be categorized into several types:
Modern processors include dedicated hardware for performance monitoring:
Performance Counter Architecture:
┌─────────────────────────────────────────────────────────────────┐
│ Performance Monitoring Unit (PMU) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Event │ Counter │ Control │ │
│ │ Select │ Registers │ Registers │ │
│ │ Logic │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Event │ Counter │ Interrupt │ │
│ │ Detection │ Overflow │ Generation │ │
│ │ Logic │ Logic │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Performance counters typically include several types of registers:
Performance counters can operate in several modes:
CPU performance counters monitor various processor events:
Different processor architectures provide different sets of performance events:
// Common x86 performance events
#define X86_EVENT_CPU_CYCLES 0x003C
#define X86_EVENT_INSTRUCTIONS 0x00C0
#define X86_EVENT_CACHE_REFERENCES 0x4F2E
#define X86_EVENT_CACHE_MISSES 0x2E41
#define X86_EVENT_BRANCH_INSTRUCTIONS 0x00C4
#define X86_EVENT_BRANCH_MISSES 0x00C5
#define X86_EVENT_PAGE_FAULTS 0x0005
#define X86_EVENT_CONTEXT_SWITCHES 0x0006
// Common ARM performance events
#define ARM_EVENT_CPU_CYCLES 0x11
#define ARM_EVENT_INSTRUCTIONS 0x08
#define ARM_EVENT_CACHE_REFERENCES 0x04
#define ARM_EVENT_CACHE_MISSES 0x03
#define ARM_EVENT_BRANCH_INSTRUCTIONS 0x07
#define ARM_EVENT_BRANCH_MISSES 0x06
#define ARM_EVENT_MEMORY_ACCESSES 0x05
#define ARM_EVENT_MEMORY_STALLS 0x0A
Performance events can be configured with various parameters:
Profiling is the process of collecting performance data to understand program behavior and identify optimization opportunities. Different profiling techniques provide different levels of detail and overhead.
Statistical profiling samples program execution at regular intervals to create a statistical profile of performance:
Statistical Profiling:
┌─────────────────────────────────────────────────────────────────┐
│ Sampling Process │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Timer │ Sample │ Profile │ │
│ │ Interrupt │ Collection │ Generation │ │
│ │ (e.g., 1ms) │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Record │ Analyze │ Generate │ │
│ │ Program │ Samples │ Report │ │
│ │ Counter │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Event-based profiling uses performance counters to trigger profiling at specific event occurrences:
Call graph profiling tracks function call relationships and execution time:
Call Graph Example:
┌─────────────────────────────────────────────────────────────────┐
│ Function Call Graph │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ main() │ 100ms │ Total execution time │ │
│ │ ├─func1() │ 60ms │ ├─func1: 60% of time │ │
│ │ │ ├─func2() │ 30ms │ │ ├─func2: 30% of time │ │
│ │ │ └─func3() │ 30ms │ │ └─func3: 30% of time │ │
│ │ └─func4() │ 40ms │ └─func4: 40% of time │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Memory profiling focuses on memory usage patterns and performance:
Linux provides several built-in performance analysis tools:
The perf tool is a comprehensive performance analysis framework:
# Basic performance profiling
perf stat ./program
# Event-based sampling
perf record -e cache-misses ./program
# Call graph profiling
perf record -g ./program
# Performance report
perf report
# Real-time monitoring
perf top
SystemTap provides dynamic tracing capabilities:
# Trace system calls
stap -e 'probe syscall.* { printf("%s\n", name) }'
# Profile function calls
stap -e 'probe kernel.function("sys_open") { printf("open called\n") }'
Intel provides specialized performance analysis tools:
Intel VTune provides comprehensive performance analysis:
# Command-line profiling
vtune -collect hotspots ./program
# Memory access analysis
vtune -collect memory-access ./program
# Cache analysis
vtune -collect cache-misses ./program
Intel SDE provides detailed instruction-level analysis:
# Instruction counting
sde -icount ./program
# Memory access tracing
sde -mix ./program
ARM provides performance analysis tools for ARM processors:
ARM Streamline provides system-wide performance analysis:
# System performance analysis
streamline -c config.xml
# Application profiling
streamline -a app_name
ARM PMU provides low-level performance monitoring:
# PMU event counting
perf stat -e armv8_pmuv3_0/cycles/ ./program
# PMU sampling
perf record -e armv8_pmuv3_0/cycles/ ./program
Programs can directly access performance counters for custom profiling:
#include <stdint.h>
#include <x86intrin.h>
// Performance counter structure
typedef struct {
uint64_t start_cycles;
uint64_t start_instructions;
} perf_counter_t;
// Start performance measurement
void perf_start(perf_counter_t* counter) {
counter->start_cycles = __rdtsc();
counter->start_instructions = __rdtsc(); // Simplified for example
}
// Stop performance measurement
void perf_stop(perf_counter_t* counter) {
uint64_t end_cycles = __rdtsc();
uint64_t end_instructions = __rdtsc(); // Simplified for example
uint64_t cycles = end_cycles - counter->start_cycles;
uint64_t instructions = end_instructions - counter->start_instructions;
printf("Cycles: %lu, Instructions: %lu, IPC: %.2f\n",
cycles, instructions, (double)instructions / cycles);
}
#include <stdint.h>
// ARM PMU access (requires kernel support)
#define ARM_PMU_PMUSERENR_EL0 "p15, 0, %0, c9, c14, 0"
#define ARM_PMU_PMCR_EL0 "p15, 0, %0, c9, c12, 0"
#define ARM_PMU_PMCNTENSET_EL0 "p15, 0, %0, c9, c12, 1"
#define ARM_PMU_PMCCNTR_EL0 "p15, 0, %0, c9, c13, 0"
// Read ARM PMU register
static inline uint32_t arm_pmu_read(const char* reg) {
uint32_t value;
__asm__ __volatile__("mrc " reg : "=r" (value));
return value;
}
// Write ARM PMU register
static inline void arm_pmu_write(const char* reg, uint32_t value) {
__asm__ __volatile__("mcr " reg : : "r" (value));
}
// Enable ARM PMU
void arm_pmu_enable() {
// Enable user mode access
arm_pmu_write(ARM_PMU_PMUSERENR_EL0, 1);
// Enable cycle counter
arm_pmu_write(ARM_PMU_PMCNTENSET_EL0, 1);
// Reset cycle counter
arm_pmu_write(ARM_PMU_PMCR_EL0, 1);
}
// Read cycle counter
uint64_t arm_pmu_read_cycles() {
return arm_pmu_read(ARM_PMU_PMCCNTR_EL0);
}
Several libraries provide portable performance counter access:
PAPI provides a portable interface to hardware performance counters:
#include <papi.h>
void papi_example() {
int events[2] = {PAPI_TOT_CYC, PAPI_TOT_INS};
long long values[2];
// Start counting
PAPI_start_counters(events, 2);
// Your code here
perform_work();
// Stop counting
PAPI_stop_counters(values, 2);
printf("Cycles: %lld, Instructions: %lld\n", values[0], values[1]);
}
Linux provides a system call interface to performance counters:
#include <linux/perf_event.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <fcntl.h>
int perf_event_open_example() {
struct perf_event_attr pe;
memset(&pe, 0, sizeof(pe));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(pe);
pe.config = PERF_COUNT_HW_CPU_CYCLES;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
int fd = syscall(__NR_perf_event_open, &pe, -1, 0, -1, 0);
if (fd == -1) {
perror("perf_event_open");
return -1;
}
// Start counting
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
// Your code here
perform_work();
// Stop counting
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
// Read result
long long count;
read(fd, &count, sizeof(count));
printf("Cycles: %lld\n", count);
close(fd);
return 0;
}
Effective performance optimization follows a systematic approach:
Performance Optimization Process:
┌─────────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1. Baseline │ 2. Profile │ 3. Identify │ │
│ │ Measure │ Code │ Bottlenecks │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 4. Optimize │ 5. Measure │ 6. Validate │ │
│ │ Code │ Results │ Improvement │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 7. Iterate │ 8. Document │ 9. Monitor │ │
│ │ Process │ Changes │ Performance │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Performance counters help identify several common bottlenecks:
// Cache-friendly matrix multiplication
void cache_friendly_multiply(float* A, float* B, float* C, int n) {
const int BLOCK_SIZE = 32; // Optimized for L1 cache
for (int i = 0; i < n; i += BLOCK_SIZE) {
for (int j = 0; j < n; j += BLOCK_SIZE) {
for (int k = 0; k < n; k += BLOCK_SIZE) {
// Process blocks to maximize cache utilization
for (int ii = i; ii < min(i + BLOCK_SIZE, n); ii++) {
for (int jj = j; jj < min(j + BLOCK_SIZE, n); jj++) {
float sum = C[ii * n + jj];
for (int kk = k; kk < min(k + BLOCK_SIZE, n); kk++) {
sum += A[ii * n + kk] * B[kk * n + jj];
}
C[ii * n + jj] = sum;
}
}
}
}
}
}
// Branch-friendly code organization
void branch_friendly_sort(int* array, int n) {
// Separate positive and negative numbers
int* positive = malloc(n * sizeof(int));
int* negative = malloc(n * sizeof(int));
int pos_count = 0, neg_count = 0;
// First pass: separate numbers (fewer branches)
for (int i = 0; i < n; i++) {
if (array[i] >= 0) {
positive[pos_count++] = array[i];
} else {
negative[neg_count++] = array[i];
}
}
// Sort positive and negative separately
sort_positive(positive, pos_count);
sort_negative(negative, neg_count);
// Combine results
memcpy(array, negative, neg_count * sizeof(int));
memcpy(array + neg_count, positive, pos_count * sizeof(int));
free(positive);
free(negative);
}
Performance monitoring should continue in production environments:
Embedded systems have specific performance monitoring considerations:
Embedded systems require lightweight monitoring approaches:
// Lightweight performance counter for embedded systems
typedef struct {
uint32_t start_time;
uint32_t start_cycles;
uint32_t start_instructions;
} lightweight_perf_t;
// Start lightweight measurement
void lightweight_perf_start(lightweight_perf_t* perf) {
perf->start_time = get_system_time();
perf->start_cycles = get_cycle_count();
perf->start_instructions = get_instruction_count();
}
// Stop lightweight measurement
void lightweight_perf_stop(lightweight_perf_t* perf) {
uint32_t end_time = get_system_time();
uint32_t end_cycles = get_cycle_count();
uint32_t end_instructions = get_instruction_count();
uint32_t time_us = end_time - perf->start_time;
uint32_t cycles = end_cycles - perf->start_cycles;
uint32_t instructions = end_instructions - perf->start_instructions;
// Log minimal performance data
log_performance(time_us, cycles, instructions);
}
Real-time systems require non-intrusive performance monitoring:
This comprehensive guide to Performance Counters provides the foundation for understanding how to measure and analyze program performance. The concepts covered here are essential for embedded software engineers working on performance-critical applications and understanding hardware behavior.