- Created AGENT-MONITORING-PROTOCOL.md - formal monitoring procedures - Added automated health check script (runs every 4 hours) - Monitors all cron jobs for errors and consecutive failures - Alerts Chris via Telegram when issues detected - Documents escalation paths and standard fixes - Establishes success metrics: zero undetected failures This ensures system reliability through proactive detection.
5.1 KiB
Agent Health Monitoring Protocol
Effective: April 8, 2026
Owner: Forge (Autonomous Operations Bot)
Priority: Critical - System Reliability
🎯 Objective
Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations.
📋 Daily Health Check Protocol
Frequency
- Automated: Every 4 hours during business hours (8 AM - 8 PM ET)
- Manual Review: Once daily at 4 PM ET (comprehensive audit)
- Heartbeat Trigger: When heartbeat.md check is performed
What to Monitor
-
Cron Job Status
openclaw cron list- Check for
errorstatus - Monitor
consecutiveErrorscount - Verify
lastRunStatusisok - Alert threshold: >3 consecutive errors
- Check for
-
Agent Logs
- Review
/logs/directories for each agent - Look for repeated failures or timeouts
- Check for successful API calls
- Review
-
Business Critical Paths
- Sales leads: Verify leads are being detected and notified
- Revenue systems: ROI Calculator & Interest Form APIs
- Daily reports: Confirm delivery of morning brief, SEO report, workout
-
System Resources
- Disk space in workspace
- API rate limits (GA4, Reddit, etc.)
- Network connectivity to endpoints
🚨 Alert Triggers
Critical (Immediate Alert Required)
- Sales lead monitor fails (missing leads = lost revenue)
- API endpoints unreachable (calc-submissions, interest form)
- Multiple agents failing simultaneously (systemic issue)
- Database or data corruption detected
High Priority (Alert Within 1 Hour)
- Any agent with >10 consecutive errors
- Daily reports not delivered (SEO, morning brief, workout)
- Reddit scout missing opportunities
- JAE/Tier-1 scorer not processing leads
Medium Priority (Include in Next Status Update)
- Single agent failure with <10 errors
- Occasional timeout or network error
- Non-critical feature degradation
📊 Status Report Format
When reporting issues to Chris, use this format:
🔔 *AGENT HEALTH ALERT* - [Severity]
**Issue:** [Brief description]
**Affected Agent:** [agent-name]
**Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"]
**Errors:** [X] consecutive failures
**Last Successful Run:** [timestamp]
**Root Cause:** [If known]
**Fix Applied:** [If already fixed]
**Action Required:** [What Chris needs to do, if anything]
**System Status:**
✅ Operational: [list critical agents working]
⚠️ Degraded: [list agents with issues]
❌ Down: [list completely failed agents]
🛠️ Standard Fixes (Autonomous)
Forge can and should fix these without asking:
-
Telegram Delivery Failures
- Change delivery target from broken
@heartbeattotelegram:8269921691 - Verify fix on next run
- Change delivery target from broken
-
Duplicate Agents
- Remove old/bash versions when Python versions exist
- Keep Chris informed of cleanup
-
Temporary Network Issues
- Retry failed API calls
- Monitor for pattern vs. one-off
-
State File Corruption
- Reset state files if corrupted
- Preserve processed lead IDs to avoid re-notification
📝 Documentation Requirements
After any issue or fix:
- Update AGENT-HEALTH-AUDIT.md with current status
- Log incident in
memory/YYYY-MM-DD.mdwith:- What failed
- Root cause
- Fix applied
- Prevention strategy
- Self-improvement: If a pattern emerges, create/update monitoring rules
🔄 Continuous Improvement
Weekly Review (Mondays at 10 AM)
- Analyze error patterns from past week
- Identify agents needing attention
- Update monitoring thresholds
- Refine alert logic
Monthly Audit
- Review all agent configurations
- Verify API credentials still valid
- Check for deprecated endpoints
- Optimize schedules for efficiency
🎯 Success Metrics
Goal: Zero undetected agent failures
Measure:
- Time from failure to detection: <1 hour
- Time from detection to resolution: <4 hours (for critical)
- Percentage of issues caught proactively: 100%
- Business impact from agent failures: Zero (catch before impact)
📞 Escalation Path
If Forge detects an issue it cannot fix autonomously:
- Immediate: Alert Chris via Telegram with full context
- If no response in 4 hours: Re-alert with urgency
- If critical (revenue impact): Suggest manual intervention
- Document: Log what prevented autonomous resolution
🔧 Tools & Commands
# Quick health check
openclaw cron list
# Detailed job info
openclaw cron list --json | python3 -m json.tool
# Fix Telegram delivery
openclaw cron edit <job-id> --channel "telegram" --to "telegram:8269921691"
# Remove broken agent
openclaw cron rm <job-id>
# View logs
tail -50 /Users/claw/.openclaw/workspace/agents/<agent>/logs/*.log
Commitment: Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues.
Last Updated: April 8, 2026
Next Review: April 15, 2026 (weekly)