# Agent Health Monitoring Protocol **Effective:** April 8, 2026 **Owner:** Forge (Autonomous Operations Bot) **Priority:** Critical - System Reliability --- ## 🎯 Objective Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations. --- ## 📋 Daily Health Check Protocol ### Frequency - **Automated:** Every 4 hours during business hours (8 AM - 8 PM ET) - **Manual Review:** Once daily at 4 PM ET (comprehensive audit) - **Heartbeat Trigger:** When heartbeat.md check is performed ### What to Monitor 1. **Cron Job Status** ```bash openclaw cron list ``` - Check for `error` status - Monitor `consecutiveErrors` count - Verify `lastRunStatus` is `ok` - Alert threshold: >3 consecutive errors 2. **Agent Logs** - Review `/logs/` directories for each agent - Look for repeated failures or timeouts - Check for successful API calls 3. **Business Critical Paths** - **Sales leads:** Verify leads are being detected and notified - **Revenue systems:** ROI Calculator & Interest Form APIs - **Daily reports:** Confirm delivery of morning brief, SEO report, workout 4. **System Resources** - Disk space in workspace - API rate limits (GA4, Reddit, etc.) - Network connectivity to endpoints --- ## 🚨 Alert Triggers ### Critical (Immediate Alert Required) - [ ] Sales lead monitor fails (missing leads = lost revenue) - [ ] API endpoints unreachable (calc-submissions, interest form) - [ ] Multiple agents failing simultaneously (systemic issue) - [ ] Database or data corruption detected ### High Priority (Alert Within 1 Hour) - [ ] Any agent with >10 consecutive errors - [ ] Daily reports not delivered (SEO, morning brief, workout) - [ ] Reddit scout missing opportunities - [ ] JAE/Tier-1 scorer not processing leads ### Medium Priority (Include in Next Status Update) - [ ] Single agent failure with <10 errors - [ ] Occasional timeout or network error - [ ] Non-critical feature degradation --- ## 📊 Status Report Format When reporting issues to Chris, use this format: ``` 🔔 *AGENT HEALTH ALERT* - [Severity] **Issue:** [Brief description] **Affected Agent:** [agent-name] **Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"] **Errors:** [X] consecutive failures **Last Successful Run:** [timestamp] **Root Cause:** [If known] **Fix Applied:** [If already fixed] **Action Required:** [What Chris needs to do, if anything] **System Status:** ✅ Operational: [list critical agents working] ⚠️ Degraded: [list agents with issues] ❌ Down: [list completely failed agents] ``` --- ## 🛠️ Standard Fixes (Autonomous) Forge can and should fix these without asking: 1. **Telegram Delivery Failures** - Change delivery target from broken `@heartbeat` to `telegram:8269921691` - Verify fix on next run 2. **Duplicate Agents** - Remove old/bash versions when Python versions exist - Keep Chris informed of cleanup 3. **Temporary Network Issues** - Retry failed API calls - Monitor for pattern vs. one-off 4. **State File Corruption** - Reset state files if corrupted - Preserve processed lead IDs to avoid re-notification --- ## 📝 Documentation Requirements After any issue or fix: 1. **Update AGENT-HEALTH-AUDIT.md** with current status 2. **Log incident** in `memory/YYYY-MM-DD.md` with: - What failed - Root cause - Fix applied - Prevention strategy 3. **Self-improvement:** If a pattern emerges, create/update monitoring rules --- ## 🔄 Continuous Improvement ### Weekly Review (Mondays at 10 AM) - Analyze error patterns from past week - Identify agents needing attention - Update monitoring thresholds - Refine alert logic ### Monthly Audit - Review all agent configurations - Verify API credentials still valid - Check for deprecated endpoints - Optimize schedules for efficiency --- ## 🎯 Success Metrics **Goal:** Zero undetected agent failures **Measure:** - Time from failure to detection: <1 hour - Time from detection to resolution: <4 hours (for critical) - Percentage of issues caught proactively: 100% - Business impact from agent failures: Zero (catch before impact) --- ## 📞 Escalation Path If Forge detects an issue it cannot fix autonomously: 1. **Immediate:** Alert Chris via Telegram with full context 2. **If no response in 4 hours:** Re-alert with urgency 3. **If critical (revenue impact):** Suggest manual intervention 4. **Document:** Log what prevented autonomous resolution --- ## 🔧 Tools & Commands ```bash # Quick health check openclaw cron list # Detailed job info openclaw cron list --json | python3 -m json.tool # Fix Telegram delivery openclaw cron edit --channel "telegram" --to "telegram:8269921691" # Remove broken agent openclaw cron rm # View logs tail -50 /Users/claw/.openclaw/workspace/agents//logs/*.log ``` --- **Commitment:** Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues. **Last Updated:** April 8, 2026 **Next Review:** April 15, 2026 (weekly)