The "Holy Bible" for embedded engineers
System Monitoring and Recovery Mechanisms for Reliable Embedded Systems
Learn to implement watchdog timers for system health monitoring, fault detection, and automatic recovery
Watchdog timers are essential safety mechanisms that monitor system health and automatically reset the system if it becomes unresponsive or enters an error state. They are critical for reliable embedded systems, especially in safety-critical applications.
System Running → Health Check → Feed Watchdog → Continue Operation
↓ ↓ ↓ ↓
Normal Monitor Tasks Reset Timer System OK
Operation Check Memory Extend Life Continue
↓ ↓ ↓ ↓
Fault Health Check No Feed Timeout
Detected Fails Watchdog Reset
Hardware Watchdog: CPU Clock → Independent Timer → Reset Circuit
↓ ↓ ↓
Independent Always Running Reliable Reset
of CPU Even if CPU Mechanism
State Crashes
Software Watchdog: System Task → Monitor Other → Trigger Recovery
↓ ↓ ↓
Part of OS Can Fail with Software-Based
Software System Recovery
Fault Detected → Assess Severity → Choose Recovery → Execute Recovery
↓ ↓ ↓ ↓
System Error Minor Fault Restart Task Continue
Major Fault Major Fault Restart App Operation
Critical Fault Critical Fault System Reset Full Recovery
Watchdog timers represent a fundamental safety principle in embedded systems: fail-safe operation. Instead of allowing a system to fail silently or become unresponsive, watchdog timers provide automatic detection and recovery mechanisms. This philosophy enables:
Watchdog timers are critical because embedded systems operate in unpredictable environments where failures are inevitable. Proper watchdog implementation enables:
Designing watchdog systems involves balancing several competing concerns:
Why it matters: Hardware watchdogs provide the most reliable system monitoring because they operate independently of the main CPU. They can detect and recover from system failures even when the CPU is completely unresponsive.
Minimal example
// Basic hardware watchdog configuration
typedef struct {
uint32_t timeout_ms; // Timeout period in milliseconds
uint32_t prescaler; // Clock prescaler value
bool window_mode; // Enable windowed mode
} hw_watchdog_config_t;
// Initialize hardware watchdog
void init_hardware_watchdog(hw_watchdog_config_t *config) {
// Enable watchdog clock (LSI = 40kHz)
RCC->CSR |= RCC_CSR_LSION;
// Wait for LSI to be ready
while (!(RCC->CSR & RCC_CSR_LSIRDY));
// Configure prescaler and reload value
IWDG->PR = config->prescaler;
IWDG->RLR = (config->timeout_ms * 40) / (1000 * (config->prescaler + 1));
// Enable and start watchdog
IWDG->KR = 0xCCCC; // Enable
IWDG->KR = 0xAAAA; // Start
}
Try it: Configure a hardware watchdog with a 1-second timeout and test system recovery.
Takeaways
Why it matters: Effective watchdog operation requires intelligent health monitoring. Simply feeding the watchdog on a timer doesn’t guarantee system health - the feeding should only occur when critical system functions are verified to be working correctly.
Minimal example
// System health monitoring structure
typedef struct {
bool tasks_running; // Critical tasks are alive
bool memory_ok; // Memory integrity check passed
bool communication_ok; // Communication systems working
bool sensors_responding; // Sensor data is valid
} system_health_t;
// Feed watchdog only if system is healthy
void feed_watchdog_if_healthy(void) {
system_health_t health = check_system_health();
if (health.tasks_running &&
health.memory_ok &&
health.communication_ok &&
health.sensors_responding) {
// System is healthy, feed watchdog
IWDG->KR = 0xAAAA;
} else {
// System has issues, let watchdog reset
// Log health status for debugging
log_system_health(&health);
}
}
Try it: Implement a health monitoring system that checks multiple system components before feeding the watchdog.
Takeaways
Why it matters: Different types of system failures require different recovery approaches. Implementing multiple recovery levels enables graceful degradation and prevents unnecessary system resets for minor issues.
Minimal example
// Recovery strategy levels
typedef enum {
RECOVERY_NONE, // No recovery needed
RECOVERY_RESTART_TASK, // Restart failed task
RECOVERY_RESTART_APP, // Restart application
RECOVERY_SYSTEM_RESET // Full system reset
} recovery_level_t;
// Determine recovery strategy based on fault type
recovery_level_t determine_recovery_strategy(system_health_t *health) {
if (!health->tasks_running) {
return RECOVERY_RESTART_TASK;
} else if (!health->memory_ok) {
return RECOVERY_RESTART_APP;
} else if (!health->communication_ok && !health->sensors_responding) {
return RECOVERY_SYSTEM_RESET;
}
return RECOVERY_NONE;
}
// Execute recovery strategy
void execute_recovery(recovery_level_t strategy) {
switch (strategy) {
case RECOVERY_RESTART_TASK:
restart_failed_tasks();
break;
case RECOVERY_RESTART_APP:
restart_application();
break;
case RECOVERY_SYSTEM_RESET:
system_reset();
break;
default:
break;
}
}
Try it: Implement a multi-level recovery system that handles different types of faults appropriately.
Takeaways
Objective: Configure a hardware watchdog and test system recovery behavior.
Steps:
Expected Outcome: Understanding of hardware watchdog operation and configuration.
Objective: Implement comprehensive system health monitoring for watchdog feeding.
Steps:
Expected Outcome: Practical experience with system health monitoring and watchdog integration.
Objective: Implement different recovery strategies based on fault severity.
Steps:
Expected Outcome: Understanding of recovery strategy design and implementation.
Next Topic: Interrupts and Exceptions → Power Management