The "Holy Bible" for embedded engineers
Understanding IEEE 754 and Floating-Point Arithmetic
Comprehensive coverage of floating-point representation, arithmetic operations, and precision considerations
Floating point is a method of representing real numbers in computers that allows for a wide range of values with varying precision. Unlike fixed-point representation, which uses a fixed number of digits before and after the decimal point, floating-point representation uses a scientific notation approach that can represent both very large and very small numbers efficiently.
The term “floating point” refers to the fact that the decimal point can “float” to different positions depending on the magnitude of the number being represented. This flexibility makes floating-point representation ideal for scientific computing, engineering applications, and any domain requiring a wide dynamic range of numerical values.
Floating-point representation embodies the principle of relative precision, where the accuracy of a number is relative to its magnitude rather than absolute. This approach provides several key benefits:
Fixed Point Representation:
┌─────────────────────────────────────────────────────────────────┐
│ 16-bit Fixed Point (8.8 format) │
│ ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│ │ Integer │ Fraction│ Range: -128.996 to 127.996 │ │
│ │ Part │ Part │ Precision: 1/256 ≈ 0.004 │ │
│ │ (8 bits)│ (8 bits)│ │ │
│ └─────────┴─────────┴─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Floating Point Representation:
┌─────────────────────────────────────────────────────────────────┐
│ 32-bit IEEE 754 Single Precision │
│ ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│ │ Sign │ Exponent│ Mantissa │ │
│ │ (1 bit) │ (8 bits)│ (23 bits) │ │
│ └─────────┴─────────┴─────────────────────────────────────────┘ │
│ Range: ±1.18 × 10^-38 to ±3.4 × 10^38 │
│ Precision: Variable (relative to magnitude) │
└─────────────────────────────────────────────────────────────────┘
Floating-point representation is based on scientific notation, where numbers are expressed as:
Number = Sign × Mantissa × Base^Exponent
Examples:
- 123.456 = +1.23456 × 10^2
- 0.00123 = +1.23 × 10^-3
- -456.789 = -4.56789 × 10^2
This representation allows the same number of significant digits to represent both very large and very small numbers efficiently.
The IEEE 754 standard defines the representation and behavior of floating-point numbers in computers. This standard ensures consistency across different hardware platforms and programming languages, making floating-point arithmetic predictable and portable.
The standard defines several floating-point formats:
IEEE 754 Single Precision (32-bit):
┌─────────────────────────────────────────────────────────────────┐
│ Bit Layout │
│ ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│ │ 31 │ 30-23 │ 22-0 │ │
│ │ Sign │ Exponent│ Mantissa │ │
│ │ (S) │ (E) │ (M) │ │
│ └─────────┴─────────┴─────────────────────────────────────────┘ │
│ │
│ Value Calculation: │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Normalized: (-1)^S × 1.M × 2^(E-127) │ │
│ │ Denormalized: (-1)^S × 0.M × 2^(-126) │ │
│ │ Special: ±∞, NaN │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
IEEE 754 Double Precision (64-bit):
┌─────────────────────────────────────────────────────────────────┐
│ Bit Layout │
│ ┌─────────┬─────────┬─────────────────────────────────────────┐ │
│ │ 63 │ 62-52 │ 51-0 │ │
│ │ Sign │ Exponent│ Mantissa │ │
│ │ (S) │ (E) │ (M) │ │
│ └─────────┴─────────┴─────────────────────────────────────────┘ │
│ │
│ Value Calculation: │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Normalized: (-1)^S × 1.M × 2^(E-1023) │ │
│ │ Denormalized: (-1)^S × 0.M × 2^(-1022) │ │
│ │ Special: ±∞, NaN │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
IEEE 754 defines several special values:
Special Value Representations:
┌─────────────────────────────────────────────────────────────────┐
│ IEEE 754 Special Values │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Value │ Exponent │ Mantissa │ │
│ │ │ (E) │ (M) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ±0 │ E = 0 │ M = 0 │ │
│ │ ±∞ │ E = 255 │ M = 0 │ │
│ │ NaN │ E = 255 │ M ≠ 0 │ │
│ │ Denormal │ E = 0 │ M ≠ 0 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Floating-point addition is more complex than integer addition due to the need to align decimal points:
Floating Point Addition Process:
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Align Decimal Points │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1.234 × 10^3 = 1234.0 │ │
│ │ 5.678 × 10^1 = 56.78 │ │
│ │ Align: 1234.0 + 56.78 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Step 2: Add Mantissas │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1234.0 + 56.78 = 1290.78 │ │
│ │ Result: 1.29078 × 10^3 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Step 3: Normalize Result │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1.29078 × 10^3 (already normalized) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Floating-point multiplication involves multiplying mantissas and adding exponents:
Floating Point Multiplication Process:
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Multiply Mantissas │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1.234 × 5.678 = 7.006652 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Step 2: Add Exponents │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 10^3 × 10^1 = 10^4 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Step 3: Normalize Result │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 7.006652 × 10^4 (already normalized) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
IEEE 754 defines several rounding modes:
Rounding Examples:
┌─────────────────────────────────────────────────────────────────┐
│ Round to Nearest, Ties to Even │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1.5 → 2.0 │ 2.5 → 2.0 │ 3.5 → 4.0 │ │
│ │ (ties to even) │ (ties to even) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Round Toward Positive │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ 1.1 → 2.0 │ 1.9 → 2.0 │ 2.0 → 2.0 │ │
│ │ (always up) │ (always up) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The Floating Point Unit (FPU) is a specialized coprocessor that handles floating-point arithmetic operations. Modern processors often integrate the FPU into the main CPU, but the architectural concepts remain the same.
FPU Register Organization:
┌─────────────────────────────────────────────────────────────────┐
│ FPU Register Stack │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ST(0) │ ST(1) │ ST(2) │ │
│ │ (Top) │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ST(3) │ ST(4) │ ST(5) │ │
│ │ │ │ │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ST(6) │ ST(7) │ Control/Status │ │
│ │ │ │ Registers │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
FPU instructions fall into several categories:
; x87 FPU example: calculate a*b + c
fld dword ptr [a] ; Load a onto FPU stack
fmul dword ptr [b] ; Multiply by b
fadd dword ptr [c] ; Add c
fstp dword ptr [result]; Store result and pop stack
; ARM VFP example: calculate a*b + c
vldr s0, [r0] ; Load a into s0
vldr s1, [r1] ; Load b into s1
vmul.f32 s0, s0, s1 ; s0 = a * b
vldr s1, [r2] ; Load c into s1
vadd.f32 s0, s0, s1 ; s0 = s0 + c
vstr s0, [r3] ; Store result
// x86 SSE intrinsic example
#include <immintrin.h>
float vector_dot_product_sse(float* a, float* b, int n) {
__m128 sum = _mm_setzero_ps();
for (int i = 0; i < n; i += 4) {
__m128 va = _mm_load_ps(&a[i]);
__m128 vb = _mm_load_ps(&b[i]);
__m128 product = _mm_mul_ps(va, vb);
sum = _mm_add_ps(sum, product);
}
// Horizontal sum
float result[4];
_mm_store_ps(result, sum);
return result[0] + result[1] + result[2] + result[3];
}
Precision and accuracy are related but distinct concepts in floating-point arithmetic:
Machine epsilon is the smallest difference between two floating-point numbers that can be represented:
Machine Epsilon Calculation:
┌─────────────────────────────────────────────────────────────────┐
│ Single Precision (32-bit) │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ε = 2^(-23) ≈ 1.19 × 10^-7 │ │
│ │ (23 mantissa bits) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Double Precision (64-bit) │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ ε = 2^(-52) ≈ 2.22 × 10^-16 │ │
│ │ (52 mantissa bits) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Several factors can lead to loss of precision in floating-point arithmetic:
Catastrophic Cancellation Example:
┌─────────────────────────────────────────────────────────────────┐
│ Problem: Calculate √(1 + x) - 1 for small x │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ Direct calculation: │ │
│ │ √(1 + 10^-8) - 1 ≈ 0.000000005000000000 │ │
│ │ (loses precision) │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
│ │
│ Better approach: Use Taylor series │ │
│ ┌─────────────┬─────────────┬─────────────────────────────────┐ │
│ │ √(1 + x) - 1 ≈ x/2 - x²/8 + x³/16 - ... │ │
│ │ For small x: x/2 is a good approximation │ │
│ │ Result: 5.000000000000000 × 10^-9 │ │
│ └─────────────┴─────────────┴─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Embedded systems often have specific requirements for floating-point operations:
For embedded systems with strict performance or power requirements, fixed-point arithmetic may be preferable:
// Fixed-point arithmetic example
typedef int32_t fixed_point_t;
#define FIXED_POINT_FRACTIONAL_BITS 16
#define FIXED_POINT_SCALE (1 << FIXED_POINT_FRACTIONAL_BITS)
// Convert float to fixed-point
fixed_point_t float_to_fixed(float f) {
return (fixed_point_t)(f * FIXED_POINT_SCALE);
}
// Convert fixed-point to float
float fixed_to_float(fixed_point_t f) {
return (float)f / FIXED_POINT_SCALE;
}
// Fixed-point multiplication
fixed_point_t fixed_multiply(fixed_point_t a, fixed_point_t b) {
int64_t result = (int64_t)a * b;
return (fixed_point_t)(result >> FIXED_POINT_FRACTIONAL_BITS);
}
ARM NEON provides efficient floating-point operations for embedded systems:
#include <arm_neon.h>
void vector_float_multiply_neon(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 4) {
float32x4_t va = vld1q_f32(&a[i]);
float32x4_t vb = vld1q_f32(&b[i]);
float32x4_t vc = vmulq_f32(va, vb);
vst1q_f32(&c[i], vc);
}
}
Floating-point numbers should rarely be tested for exact equality due to rounding errors:
// Wrong way to test floating-point equality
if (a == b) { /* ... */ }
// Better approach: test with tolerance
#define EPSILON 1e-6
if (fabs(a - b) < EPSILON) { /* ... */ }
// Best approach: use relative tolerance
#define RELATIVE_TOLERANCE 1e-6
if (fabs(a - b) <= RELATIVE_TOLERANCE * fmax(fabs(a), fabs(b))) {
/* ... */
}
Accumulating many floating-point numbers can lead to significant errors:
// Problematic: accumulating many small numbers
float sum = 0.0f;
for (int i = 0; i < 1000000; i++) {
sum += 0.1f; // Accumulates rounding errors
}
// Result may not be exactly 100000.0
// Better: use compensated summation (Kahan algorithm)
float kahan_sum(float* values, int n) {
float sum = 0.0f;
float c = 0.0f; // Running compensation
for (int i = 0; i < n; i++) {
float y = values[i] - c;
float t = sum + y;
c = (t - sum) - y;
sum = t;
}
return sum;
}
Floating-point operations can overflow to infinity or underflow to zero:
// Check for overflow and underflow
#include <math.h>
#include <float.h>
float safe_multiply(float a, float b) {
if (fabs(a) > sqrt(FLT_MAX) || fabs(b) > sqrt(FLT_MAX)) {
// Potential overflow
return INFINITY;
}
if (fabs(a) < sqrt(FLT_MIN) && fabs(b) < sqrt(FLT_MIN)) {
// Potential underflow
return 0.0f;
}
return a * b;
}
Floating-point operations have different performance characteristics than integer operations:
Several strategies can improve floating-point performance:
// Enable automatic vectorization
void vectorized_operation(float* a, float* b, float* c, int n) {
#pragma omp simd
for (int i = 0; i < n; i++) {
c[i] = a[i] * b[i] + 1.0f;
}
}
// Optimize memory access patterns
void optimized_matrix_multiply(float* A, float* B, float* C, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
sum += A[i * n + k] * B[k * n + j];
}
C[i * n + j] = sum;
}
}
}
Modern processors support fused multiply-add operations that combine multiplication and addition in a single instruction:
// Use fused multiply-add when available
#ifdef __FMA__
result = fma(a, b, c); // Single instruction: a * b + c
#else
result = a * b + c; // Separate multiply and add
#endif
This comprehensive guide to Floating Point provides the foundation for understanding how computers represent and manipulate real numbers. The concepts covered here are essential for embedded software engineers working with numerical computations and understanding precision and accuracy in floating-point arithmetic.