The Embedded New Testament

The "Holy Bible" for embedded engineers


Project maintained by theEmbeddedGeorge Hosted on GitHub Pages — Theme by mattgraham

🐕 Watchdog Timers

System Monitoring and Recovery Mechanisms for Reliable Embedded Systems
Learn to implement watchdog timers for system health monitoring, fault detection, and automatic recovery


📋 Table of Contents


🎯 Overview

Watchdog timers are essential safety mechanisms that monitor system health and automatically reset the system if it becomes unresponsive or enters an error state. They are critical for reliable embedded systems, especially in safety-critical applications.


🚀 Quick Reference: Key Facts


🔍 Visual Understanding

Watchdog Timer Operation Flow

System Running → Health Check → Feed Watchdog → Continue Operation
     ↓              ↓              ↓              ↓
  Normal      Monitor Tasks    Reset Timer    System OK
Operation     Check Memory    Extend Life    Continue

     ↓              ↓              ↓              ↓
  Fault        Health Check     No Feed      Timeout
Detected        Fails          Watchdog      Reset

Watchdog Types Comparison

Hardware Watchdog:    CPU Clock → Independent Timer → Reset Circuit
                            ↓              ↓              ↓
                      Independent    Always Running    Reliable Reset
                      of CPU        Even if CPU      Mechanism
                      State         Crashes

Software Watchdog:    System Task → Monitor Other → Trigger Recovery
                            ↓              ↓              ↓
                      Part of OS    Can Fail with    Software-Based
                      Software      System           Recovery

Recovery Strategy Hierarchy

Fault Detected → Assess Severity → Choose Recovery → Execute Recovery
     ↓              ↓              ↓              ↓
  System Error   Minor Fault    Restart Task    Continue
  Major Fault    Major Fault    Restart App     Operation
  Critical Fault Critical Fault System Reset    Full Recovery

🧠 Conceptual Foundation

The Watchdog as a Safety Net

Watchdog timers represent a fundamental safety principle in embedded systems: fail-safe operation. Instead of allowing a system to fail silently or become unresponsive, watchdog timers provide automatic detection and recovery mechanisms. This philosophy enables:

Why Watchdog Timers Matter

Watchdog timers are critical because embedded systems operate in unpredictable environments where failures are inevitable. Proper watchdog implementation enables:

The Watchdog Design Challenge

Designing watchdog systems involves balancing several competing concerns:


🎯 Core Concepts

Concept: Hardware Watchdog Configuration and Operation

Why it matters: Hardware watchdogs provide the most reliable system monitoring because they operate independently of the main CPU. They can detect and recover from system failures even when the CPU is completely unresponsive.

Minimal example

// Basic hardware watchdog configuration
typedef struct {
    uint32_t timeout_ms;          // Timeout period in milliseconds
    uint32_t prescaler;           // Clock prescaler value
    bool window_mode;             // Enable windowed mode
} hw_watchdog_config_t;

// Initialize hardware watchdog
void init_hardware_watchdog(hw_watchdog_config_t *config) {
    // Enable watchdog clock (LSI = 40kHz)
    RCC->CSR |= RCC_CSR_LSION;
    
    // Wait for LSI to be ready
    while (!(RCC->CSR & RCC_CSR_LSIRDY));
    
    // Configure prescaler and reload value
    IWDG->PR = config->prescaler;
    IWDG->RLR = (config->timeout_ms * 40) / (1000 * (config->prescaler + 1));
    
    // Enable and start watchdog
    IWDG->KR = 0xCCCC;  // Enable
    IWDG->KR = 0xAAAA;  // Start
}

Try it: Configure a hardware watchdog with a 1-second timeout and test system recovery.

Takeaways

Concept: System Health Monitoring and Watchdog Feeding Strategy

Why it matters: Effective watchdog operation requires intelligent health monitoring. Simply feeding the watchdog on a timer doesn’t guarantee system health - the feeding should only occur when critical system functions are verified to be working correctly.

Minimal example

// System health monitoring structure
typedef struct {
    bool tasks_running;           // Critical tasks are alive
    bool memory_ok;              // Memory integrity check passed
    bool communication_ok;        // Communication systems working
    bool sensors_responding;      // Sensor data is valid
} system_health_t;

// Feed watchdog only if system is healthy
void feed_watchdog_if_healthy(void) {
    system_health_t health = check_system_health();
    
    if (health.tasks_running && 
        health.memory_ok && 
        health.communication_ok && 
        health.sensors_responding) {
        
        // System is healthy, feed watchdog
        IWDG->KR = 0xAAAA;
    } else {
        // System has issues, let watchdog reset
        // Log health status for debugging
        log_system_health(&health);
    }
}

Try it: Implement a health monitoring system that checks multiple system components before feeding the watchdog.

Takeaways

Concept: Recovery Strategy Selection and Implementation

Why it matters: Different types of system failures require different recovery approaches. Implementing multiple recovery levels enables graceful degradation and prevents unnecessary system resets for minor issues.

Minimal example

// Recovery strategy levels
typedef enum {
    RECOVERY_NONE,           // No recovery needed
    RECOVERY_RESTART_TASK,   // Restart failed task
    RECOVERY_RESTART_APP,    // Restart application
    RECOVERY_SYSTEM_RESET    // Full system reset
} recovery_level_t;

// Determine recovery strategy based on fault type
recovery_level_t determine_recovery_strategy(system_health_t *health) {
    if (!health->tasks_running) {
        return RECOVERY_RESTART_TASK;
    } else if (!health->memory_ok) {
        return RECOVERY_RESTART_APP;
    } else if (!health->communication_ok && !health->sensors_responding) {
        return RECOVERY_SYSTEM_RESET;
    }
    
    return RECOVERY_NONE;
}

// Execute recovery strategy
void execute_recovery(recovery_level_t strategy) {
    switch (strategy) {
        case RECOVERY_RESTART_TASK:
            restart_failed_tasks();
            break;
        case RECOVERY_RESTART_APP:
            restart_application();
            break;
        case RECOVERY_SYSTEM_RESET:
            system_reset();
            break;
        default:
            break;
    }
}

Try it: Implement a multi-level recovery system that handles different types of faults appropriately.

Takeaways


🧪 Guided Labs

Lab 1: Hardware Watchdog Configuration and Testing

Objective: Configure a hardware watchdog and test system recovery behavior.

Steps:

  1. Configure hardware watchdog with appropriate timeout
  2. Implement basic health monitoring
  3. Test system recovery by intentionally causing faults
  4. Measure recovery time and verify system behavior

Expected Outcome: Understanding of hardware watchdog operation and configuration.

Lab 2: System Health Monitoring Implementation

Objective: Implement comprehensive system health monitoring for watchdog feeding.

Steps:

  1. Define critical system health metrics
  2. Implement health checking functions
  3. Integrate health monitoring with watchdog feeding
  4. Test health monitoring accuracy and reliability

Expected Outcome: Practical experience with system health monitoring and watchdog integration.

Lab 3: Multi-Level Recovery Strategy Implementation

Objective: Implement different recovery strategies based on fault severity.

Steps:

  1. Define fault categories and recovery levels
  2. Implement recovery strategy selection logic
  3. Create recovery execution functions
  4. Test recovery behavior under various fault conditions

Expected Outcome: Understanding of recovery strategy design and implementation.


Check Yourself

Basic Understanding

Practical Application

Advanced Concepts



🎯 Practical Considerations

System-Level Design Decisions

Development and Debugging

Safety and Reliability


📚 Additional Resources

Books

Online Resources


Next Topic: Interrupts and ExceptionsPower Management