feat: Proactive agent health monitoring system

- Created AGENT-MONITORING-PROTOCOL.md - formal monitoring procedures - Added automated health check script (runs every 4 hours) - Monitors all cron jobs for errors and consecutive failures - Alerts Chris via Telegram when issues detected - Documents escalation paths and standard fixes - Establishes success metrics: zero undetected failures This ensures system reliability through proactive detection.
2026-04-08 11:52:53 -04:00
parent 311d498941
commit 1bd3e724fe
2 changed files with 282 additions and 0 deletions
--- a/AGENT-MONITORING-PROTOCOL.md
+++ b/AGENT-MONITORING-PROTOCOL.md
@@ -0,0 +1,195 @@
+# Agent Health Monitoring Protocol
+
+**Effective:** April 8, 2026  
+**Owner:** Forge (Autonomous Operations Bot)  
+**Priority:** Critical - System Reliability
+
+---
+
+## 🎯 Objective
+
+Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations.
+
+---
+
+## 📋 Daily Health Check Protocol
+
+### Frequency
+- **Automated:** Every 4 hours during business hours (8 AM - 8 PM ET)
+- **Manual Review:** Once daily at 4 PM ET (comprehensive audit)
+- **Heartbeat Trigger:** When heartbeat.md check is performed
+
+### What to Monitor
+
+1. **Cron Job Status**
+   ```bash
+   openclaw cron list
+   ```
+   - Check for `error` status
+   - Monitor `consecutiveErrors` count
+   - Verify `lastRunStatus` is `ok`
+   - Alert threshold: >3 consecutive errors
+
+2. **Agent Logs**
+   - Review `/logs/` directories for each agent
+   - Look for repeated failures or timeouts
+   - Check for successful API calls
+
+3. **Business Critical Paths**
+   - **Sales leads:** Verify leads are being detected and notified
+   - **Revenue systems:** ROI Calculator & Interest Form APIs
+   - **Daily reports:** Confirm delivery of morning brief, SEO report, workout
+
+4. **System Resources**
+   - Disk space in workspace
+   - API rate limits (GA4, Reddit, etc.)
+   - Network connectivity to endpoints
+
+---
+
+## 🚨 Alert Triggers
+
+### Critical (Immediate Alert Required)
+- [ ] Sales lead monitor fails (missing leads = lost revenue)
+- [ ] API endpoints unreachable (calc-submissions, interest form)
+- [ ] Multiple agents failing simultaneously (systemic issue)
+- [ ] Database or data corruption detected
+
+### High Priority (Alert Within 1 Hour)
+- [ ] Any agent with >10 consecutive errors
+- [ ] Daily reports not delivered (SEO, morning brief, workout)
+- [ ] Reddit scout missing opportunities
+- [ ] JAE/Tier-1 scorer not processing leads
+
+### Medium Priority (Include in Next Status Update)
+- [ ] Single agent failure with <10 errors
+- [ ] Occasional timeout or network error
+- [ ] Non-critical feature degradation
+
+---
+
+## 📊 Status Report Format
+
+When reporting issues to Chris, use this format:
+
+```
+🔔 *AGENT HEALTH ALERT* - [Severity]
+
+**Issue:** [Brief description]
+**Affected Agent:** [agent-name]
+**Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"]
+**Errors:** [X] consecutive failures
+**Last Successful Run:** [timestamp]
+
+**Root Cause:** [If known]
+**Fix Applied:** [If already fixed]
+**Action Required:** [What Chris needs to do, if anything]
+
+**System Status:**
+✅ Operational: [list critical agents working]
+⚠️ Degraded: [list agents with issues]
+❌ Down: [list completely failed agents]
+```
+
+---
+
+## 🛠️ Standard Fixes (Autonomous)
+
+Forge can and should fix these without asking:
+
+1. **Telegram Delivery Failures**
+   - Change delivery target from broken `@heartbeat` to `telegram:8269921691`
+   - Verify fix on next run
+
+2. **Duplicate Agents**
+   - Remove old/bash versions when Python versions exist
+   - Keep Chris informed of cleanup
+
+3. **Temporary Network Issues**
+   - Retry failed API calls
+   - Monitor for pattern vs. one-off
+
+4. **State File Corruption**
+   - Reset state files if corrupted
+   - Preserve processed lead IDs to avoid re-notification
+
+---
+
+## 📝 Documentation Requirements
+
+After any issue or fix:
+
+1. **Update AGENT-HEALTH-AUDIT.md** with current status
+2. **Log incident** in `memory/YYYY-MM-DD.md` with:
+   - What failed
+   - Root cause
+   - Fix applied
+   - Prevention strategy
+3. **Self-improvement:** If a pattern emerges, create/update monitoring rules
+
+---
+
+## 🔄 Continuous Improvement
+
+### Weekly Review (Mondays at 10 AM)
+- Analyze error patterns from past week
+- Identify agents needing attention
+- Update monitoring thresholds
+- Refine alert logic
+
+### Monthly Audit
+- Review all agent configurations
+- Verify API credentials still valid
+- Check for deprecated endpoints
+- Optimize schedules for efficiency
+
+---
+
+## 🎯 Success Metrics
+
+**Goal:** Zero undetected agent failures
+
+**Measure:**
+- Time from failure to detection: <1 hour
+- Time from detection to resolution: <4 hours (for critical)
+- Percentage of issues caught proactively: 100%
+- Business impact from agent failures: Zero (catch before impact)
+
+---
+
+## 📞 Escalation Path
+
+If Forge detects an issue it cannot fix autonomously:
+
+1. **Immediate:** Alert Chris via Telegram with full context
+2. **If no response in 4 hours:** Re-alert with urgency
+3. **If critical (revenue impact):** Suggest manual intervention
+4. **Document:** Log what prevented autonomous resolution
+
+---
+
+## 🔧 Tools & Commands
+
+```bash
+# Quick health check
+openclaw cron list
+
+# Detailed job info
+openclaw cron list --json | python3 -m json.tool
+
+# Fix Telegram delivery
+openclaw cron edit <job-id> --channel "telegram" --to "telegram:8269921691"
+
+# Remove broken agent
+openclaw cron rm <job-id>
+
+# View logs
+tail -50 /Users/claw/.openclaw/workspace/agents/<agent>/logs/*.log
+```
+
+---
+
+**Commitment:** Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues.
+
+**Last Updated:** April 8, 2026  
+**Next Review:** April 15, 2026 (weekly)