The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

Floating Point

Understanding IEEE 754 and Floating-Point Arithmetic
Comprehensive coverage of floating-point representation, arithmetic operations, and precision considerations


📋 Table of Contents


🏗️ Floating Point Fundamentals

What is Floating Point?

Floating point is a method of representing real numbers in computers that allows for a wide range of values with varying precision. Unlike fixed-point representation, which uses a fixed number of digits before and after the decimal point, floating-point representation uses a scientific notation approach that can represent both very large and very small numbers efficiently.

The term “floating point” refers to the fact that the decimal point can “float” to different positions depending on the magnitude of the number being represented. This flexibility makes floating-point representation ideal for scientific computing, engineering applications, and any domain requiring a wide dynamic range of numerical values.

Floating Point Philosophy

Floating-point representation embodies the principle of relative precision, where the accuracy of a number is relative to its magnitude rather than absolute. This approach provides several key benefits:

  1. Wide Dynamic Range: Can represent numbers from very small to very large
  2. Efficient Storage: Uses storage space efficiently across different magnitude ranges
  3. Scientific Notation: Natural representation for scientific and engineering calculations
  4. Hardware Efficiency: Optimized for arithmetic operations in modern processors

Floating Point vs. Fixed Point

Fixed Point Representation:
┌─────────────────────────────────────────────────────────────────┐
│  16-bit Fixed Point (8.8 format)                               │
│  ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│  │ Integer │ Fraction│  Range: -128.996 to 127.996             │ │
│  │  Part   │  Part   │  Precision: 1/256 ≈ 0.004               │ │
│  │ (8 bits)│ (8 bits)│                                         │ │
│  └─────────┴─────────┴─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Floating Point Representation:
┌─────────────────────────────────────────────────────────────────┐
│  32-bit IEEE 754 Single Precision                              │
│  ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│  │ Sign    │ Exponent│  Mantissa                               │ │
│  │ (1 bit) │ (8 bits)│  (23 bits)                              │ │
│  └─────────┴─────────┴─────────────────────────────────────────┘ │
│  Range: ±1.18 × 10^-38 to ±3.4 × 10^38                        │
│  Precision: Variable (relative to magnitude)                    │
└─────────────────────────────────────────────────────────────────┘

Scientific Notation Basis

Floating-point representation is based on scientific notation, where numbers are expressed as:

Number = Sign × Mantissa × Base^Exponent

Examples:
- 123.456 = +1.23456 × 10^2
- 0.00123 = +1.23 × 10^-3
- -456.789 = -4.56789 × 10^2

This representation allows the same number of significant digits to represent both very large and very small numbers efficiently.


📊 IEEE 754 Standard

IEEE 754 Fundamentals

The IEEE 754 standard defines the representation and behavior of floating-point numbers in computers. This standard ensures consistency across different hardware platforms and programming languages, making floating-point arithmetic predictable and portable.

The standard defines several floating-point formats:

  1. Single Precision (32-bit): 1 sign bit, 8 exponent bits, 23 mantissa bits
  2. Double Precision (64-bit): 1 sign bit, 11 exponent bits, 52 mantissa bits
  3. Extended Precision (80-bit): 1 sign bit, 15 exponent bits, 64 mantissa bits
  4. Half Precision (16-bit): 1 sign bit, 5 exponent bits, 10 mantissa bits

IEEE 754 Single Precision Format

IEEE 754 Single Precision (32-bit):
┌─────────────────────────────────────────────────────────────────┐
│  Bit Layout                                                     │
│  ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│  │ 31      │ 30-23   │  22-0                                    │ │
│  │ Sign    │ Exponent│  Mantissa                                │ │
│  │ (S)     │ (E)     │  (M)                                     │ │
│  └─────────┴─────────┴─────────────────────────────────────────┘ │
│                                                                 │
│  Value Calculation:                                             │ │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Normalized: (-1)^S × 1.M × 2^(E-127)                      │ │
│  │  Denormalized: (-1)^S × 0.M × 2^(-126)                     │ │
│  │  Special: ±∞, NaN                                           │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

IEEE 754 Double Precision Format

IEEE 754 Double Precision (64-bit):
┌─────────────────────────────────────────────────────────────────┐
│  Bit Layout                                                     │
│  ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│  │ 63      │ 62-52   │  51-0                                    │ │
│  │ Sign    │ Exponent│  Mantissa                                │ │
│  │ (S)     │ (E)     │  (M)                                     │ │
│  └─────────┴─────────┴─────────────────────────────────────────┘ │
│                                                                 │
│  Value Calculation:                                             │ │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │  Normalized: (-1)^S × 1.M × 2^(E-1023)                     │ │
│  │  Denormalized: (-1)^S × 0.M × 2^(-1022)                    │ │
│  │  Special: ±∞, NaN                                           │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Special Values

IEEE 754 defines several special values:

  1. Zero: Both +0 and -0 are represented
  2. Infinity: ±∞ for overflow conditions
  3. NaN (Not a Number): For invalid operations
  4. Denormalized Numbers: Very small numbers below normal range
Special Value Representations:
┌─────────────────────────────────────────────────────────────────┐
│  IEEE 754 Special Values                                       │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Value       │ Exponent    │  Mantissa                       │ │
│  │             │ (E)         │  (M)                            │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ±0          │ E = 0       │  M = 0                          │ │
│  │ ±∞          │ E = 255     │  M = 0                          │ │
│  │ NaN         │ E = 255     │  M ≠ 0                          │ │
│  │ Denormal    │ E = 0       │  M ≠ 0                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

🔢 Floating Point Arithmetic

Floating Point Addition

Floating-point addition is more complex than integer addition due to the need to align decimal points:

Floating Point Addition Process:
┌─────────────────────────────────────────────────────────────────┐
│  Step 1: Align Decimal Points                                  │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1.234 × 10^3 = 1234.0                                      │ │
│  │ 5.678 × 10^1 = 56.78                                        │ │
│  │ Align: 1234.0 + 56.78                                       │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Step 2: Add Mantissas                                         │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1234.0 + 56.78 = 1290.78                                   │ │
│  │ Result: 1.29078 × 10^3                                      │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Step 3: Normalize Result                                       │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1.29078 × 10^3 (already normalized)                         │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Floating Point Multiplication

Floating-point multiplication involves multiplying mantissas and adding exponents:

Floating Point Multiplication Process:
┌─────────────────────────────────────────────────────────────────┐
│  Step 1: Multiply Mantissas                                    │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1.234 × 5.678 = 7.006652                                   │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Step 2: Add Exponents                                         │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 10^3 × 10^1 = 10^4                                         │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Step 3: Normalize Result                                       │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 7.006652 × 10^4 (already normalized)                        │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Rounding Modes

IEEE 754 defines several rounding modes:

  1. Round to Nearest, Ties to Even: Default rounding mode
  2. Round Toward Positive: Always round up
  3. Round Toward Negative: Always round down
  4. Round Toward Zero: Truncate toward zero
Rounding Examples:
┌─────────────────────────────────────────────────────────────────┐
│  Round to Nearest, Ties to Even                                │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1.5 → 2.0   │ 2.5 → 2.0   │  3.5 → 4.0                     │ │
│  │ (ties to even)              │  (ties to even)                │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Round Toward Positive                                          │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ 1.1 → 2.0   │ 1.9 → 2.0   │  2.0 → 2.0                     │ │
│  │ (always up)                 │  (always up)                   │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

⚙️ FPU Architecture and Programming

FPU Fundamentals

The Floating Point Unit (FPU) is a specialized coprocessor that handles floating-point arithmetic operations. Modern processors often integrate the FPU into the main CPU, but the architectural concepts remain the same.

FPU Register Organization

FPU Register Organization:
┌─────────────────────────────────────────────────────────────────┐
│  FPU Register Stack                                             │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ST(0)       │ ST(1)       │  ST(2)                         │ │
│  │ (Top)       │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ST(3)       │ ST(4)       │  ST(5)                         │ │
│  │             │             │                                 │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ST(6)       │ ST(7)       │  Control/Status                │ │
│  │             │             │  Registers                      │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

FPU Instruction Categories

FPU instructions fall into several categories:

  1. Data Transfer: Load/store operations
  2. Arithmetic: Addition, subtraction, multiplication, division
  3. Comparison: Compare operations
  4. Transcendental: Trigonometric, logarithmic, exponential functions
  5. Control: Status and control register operations

FPU Programming Examples

x86 FPU Assembly

; x87 FPU example: calculate a*b + c
fld dword ptr [a]      ; Load a onto FPU stack
fmul dword ptr [b]     ; Multiply by b
fadd dword ptr [c]     ; Add c
fstp dword ptr [result]; Store result and pop stack

ARM VFP Assembly

; ARM VFP example: calculate a*b + c
vldr s0, [r0]          ; Load a into s0
vldr s1, [r1]          ; Load b into s1
vmul.f32 s0, s0, s1    ; s0 = a * b
vldr s1, [r2]          ; Load c into s1
vadd.f32 s0, s0, s1    ; s0 = s0 + c
vstr s0, [r3]          ; Store result

C/C++ with Intrinsics

// x86 SSE intrinsic example
#include <immintrin.h>

float vector_dot_product_sse(float* a, float* b, int n) {
    __m128 sum = _mm_setzero_ps();
    
    for (int i = 0; i < n; i += 4) {
        __m128 va = _mm_load_ps(&a[i]);
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 product = _mm_mul_ps(va, vb);
        sum = _mm_add_ps(sum, product);
    }
    
    // Horizontal sum
    float result[4];
    _mm_store_ps(result, sum);
    return result[0] + result[1] + result[2] + result[3];
}

🎯 Precision and Accuracy

Precision vs. Accuracy

Precision and accuracy are related but distinct concepts in floating-point arithmetic:

Machine Epsilon

Machine epsilon is the smallest difference between two floating-point numbers that can be represented:

Machine Epsilon Calculation:
┌─────────────────────────────────────────────────────────────────┐
│  Single Precision (32-bit)                                     │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ε = 2^(-23) ≈ 1.19 × 10^-7                                 │ │
│  │ (23 mantissa bits)                                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Double Precision (64-bit)                                      │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ ε = 2^(-52) ≈ 2.22 × 10^-16                                │ │
│  │ (52 mantissa bits)                                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Loss of Precision

Several factors can lead to loss of precision in floating-point arithmetic:

  1. Subtraction of Nearly Equal Numbers: Can result in catastrophic cancellation
  2. Accumulation of Rounding Errors: Errors accumulate in iterative calculations
  3. Order of Operations: Different evaluation orders can produce different results
Catastrophic Cancellation Example:
┌─────────────────────────────────────────────────────────────────┐
│  Problem: Calculate √(1 + x) - 1 for small x                   │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ Direct calculation:                                        │ │
│  │ √(1 + 10^-8) - 1 ≈ 0.000000005000000000                   │ │
│  │ (loses precision)                                           │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
│                                                                 │
│  Better approach: Use Taylor series                            │ │
│  ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│  │ √(1 + x) - 1 ≈ x/2 - x²/8 + x³/16 - ...                   │ │
│  │ For small x: x/2 is a good approximation                   │ │
│  │ Result: 5.000000000000000 × 10^-9                          │ │
│  └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

🔧 Floating Point in Embedded Systems

Embedded System Considerations

Embedded systems often have specific requirements for floating-point operations:

  1. Performance: Floating-point operations can be slower than integer operations
  2. Power Consumption: FPU operations consume more power
  3. Memory Usage: Floating-point data types use more memory
  4. Real-time Constraints: Predictable execution time requirements

Fixed-Point Alternatives

For embedded systems with strict performance or power requirements, fixed-point arithmetic may be preferable:

// Fixed-point arithmetic example
typedef int32_t fixed_point_t;
#define FIXED_POINT_FRACTIONAL_BITS 16
#define FIXED_POINT_SCALE (1 << FIXED_POINT_FRACTIONAL_BITS)

// Convert float to fixed-point
fixed_point_t float_to_fixed(float f) {
    return (fixed_point_t)(f * FIXED_POINT_SCALE);
}

// Convert fixed-point to float
float fixed_to_float(fixed_point_t f) {
    return (float)f / FIXED_POINT_SCALE;
}

// Fixed-point multiplication
fixed_point_t fixed_multiply(fixed_point_t a, fixed_point_t b) {
    int64_t result = (int64_t)a * b;
    return (fixed_point_t)(result >> FIXED_POINT_FRACTIONAL_BITS);
}

ARM NEON Floating Point

ARM NEON provides efficient floating-point operations for embedded systems:

#include <arm_neon.h>

void vector_float_multiply_neon(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vc = vmulq_f32(va, vb);
        vst1q_f32(&c[i], vc);
    }
}

⚠️ Common Pitfalls and Solutions

Equality Testing

Floating-point numbers should rarely be tested for exact equality due to rounding errors:

// Wrong way to test floating-point equality
if (a == b) { /* ... */ }

// Better approach: test with tolerance
#define EPSILON 1e-6
if (fabs(a - b) < EPSILON) { /* ... */ }

// Best approach: use relative tolerance
#define RELATIVE_TOLERANCE 1e-6
if (fabs(a - b) <= RELATIVE_TOLERANCE * fmax(fabs(a), fabs(b))) {
    /* ... */
}

Accumulation Errors

Accumulating many floating-point numbers can lead to significant errors:

// Problematic: accumulating many small numbers
float sum = 0.0f;
for (int i = 0; i < 1000000; i++) {
    sum += 0.1f;  // Accumulates rounding errors
}
// Result may not be exactly 100000.0

// Better: use compensated summation (Kahan algorithm)
float kahan_sum(float* values, int n) {
    float sum = 0.0f;
    float c = 0.0f;  // Running compensation
    
    for (int i = 0; i < n; i++) {
        float y = values[i] - c;
        float t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }
    
    return sum;
}

Overflow and Underflow

Floating-point operations can overflow to infinity or underflow to zero:

// Check for overflow and underflow
#include <math.h>
#include <float.h>

float safe_multiply(float a, float b) {
    if (fabs(a) > sqrt(FLT_MAX) || fabs(b) > sqrt(FLT_MAX)) {
        // Potential overflow
        return INFINITY;
    }
    
    if (fabs(a) < sqrt(FLT_MIN) && fabs(b) < sqrt(FLT_MIN)) {
        // Potential underflow
        return 0.0f;
    }
    
    return a * b;
}

Performance Considerations

FPU Performance Characteristics

Floating-point operations have different performance characteristics than integer operations:

  1. Latency: FPU operations typically have higher latency
  2. Throughput: Modern processors can often overlap FPU operations
  3. Memory Bandwidth: Floating-point data requires more memory bandwidth
  4. Cache Efficiency: Floating-point data may have different cache behavior

Optimization Strategies

Several strategies can improve floating-point performance:

Loop Vectorization

// Enable automatic vectorization
void vectorized_operation(float* a, float* b, float* c, int n) {
    #pragma omp simd
    for (int i = 0; i < n; i++) {
        c[i] = a[i] * b[i] + 1.0f;
    }
}

Memory Access Optimization

// Optimize memory access patterns
void optimized_matrix_multiply(float* A, float* B, float* C, int n) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            float sum = 0.0f;
            for (int k = 0; k < n; k++) {
                sum += A[i * n + k] * B[k * n + j];
            }
            C[i * n + j] = sum;
        }
    }
}

Fused Multiply-Add

Modern processors support fused multiply-add operations that combine multiplication and addition in a single instruction:

// Use fused multiply-add when available
#ifdef __FMA__
    result = fma(a, b, c);  // Single instruction: a * b + c
#else
    result = a * b + c;     // Separate multiply and add
#endif

📚 Further Reading and Resources


This comprehensive guide to Floating Point provides the foundation for understanding how computers represent and manipulate real numbers. The concepts covered here are essential for embedded software engineers working with numerical computations and understanding precision and accuracy in floating-point arithmetic.