Files

olsch01 1bd3e724fe feat: Proactive agent health monitoring system

- Created AGENT-MONITORING-PROTOCOL.md - formal monitoring procedures
- Added automated health check script (runs every 4 hours)
- Monitors all cron jobs for errors and consecutive failures
- Alerts Chris via Telegram when issues detected
- Documents escalation paths and standard fixes
- Establishes success metrics: zero undetected failures

This ensures system reliability through proactive detection.

2026-04-08 11:52:53 -04:00

5.1 KiB

Raw Permalink Blame History

Agent Health Monitoring Protocol

Effective: April 8, 2026
Owner: Forge (Autonomous Operations Bot)
Priority: Critical - System Reliability

🎯 Objective

Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations.

📋 Daily Health Check Protocol

Frequency

Automated: Every 4 hours during business hours (8 AM - 8 PM ET)
Manual Review: Once daily at 4 PM ET (comprehensive audit)
Heartbeat Trigger: When heartbeat.md check is performed

What to Monitor

Cron Job Status
```
openclaw cron list
```
- Check for error status
- Monitor consecutiveErrors count
- Verify lastRunStatus is ok
- Alert threshold: >3 consecutive errors
Agent Logs
- Review /logs/ directories for each agent
- Look for repeated failures or timeouts
- Check for successful API calls
Business Critical Paths
- Sales leads: Verify leads are being detected and notified
- Revenue systems: ROI Calculator & Interest Form APIs
- Daily reports: Confirm delivery of morning brief, SEO report, workout
System Resources
- Disk space in workspace
- API rate limits (GA4, Reddit, etc.)
- Network connectivity to endpoints

🚨 Alert Triggers

Critical (Immediate Alert Required)

Sales lead monitor fails (missing leads = lost revenue)
API endpoints unreachable (calc-submissions, interest form)
Multiple agents failing simultaneously (systemic issue)
Database or data corruption detected

High Priority (Alert Within 1 Hour)

Any agent with >10 consecutive errors
Daily reports not delivered (SEO, morning brief, workout)
Reddit scout missing opportunities
JAE/Tier-1 scorer not processing leads

Medium Priority (Include in Next Status Update)

Single agent failure with <10 errors
Occasional timeout or network error
Non-critical feature degradation

📊 Status Report Format

When reporting issues to Chris, use this format:

🔔 *AGENT HEALTH ALERT* - [Severity]

**Issue:** [Brief description]
**Affected Agent:** [agent-name]
**Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"]
**Errors:** [X] consecutive failures
**Last Successful Run:** [timestamp]

**Root Cause:** [If known]
**Fix Applied:** [If already fixed]
**Action Required:** [What Chris needs to do, if anything]

**System Status:**
✅ Operational: [list critical agents working]
⚠️ Degraded: [list agents with issues]
❌ Down: [list completely failed agents]

🛠️ Standard Fixes (Autonomous)

Forge can and should fix these without asking:

Telegram Delivery Failures
- Change delivery target from broken @heartbeat to telegram:8269921691
- Verify fix on next run
Duplicate Agents
- Remove old/bash versions when Python versions exist
- Keep Chris informed of cleanup
Temporary Network Issues
- Retry failed API calls
- Monitor for pattern vs. one-off
State File Corruption
- Reset state files if corrupted
- Preserve processed lead IDs to avoid re-notification

📝 Documentation Requirements

After any issue or fix:

Update AGENT-HEALTH-AUDIT.md with current status
Log incident in memory/YYYY-MM-DD.md with:
- What failed
- Root cause
- Fix applied
- Prevention strategy
Self-improvement: If a pattern emerges, create/update monitoring rules

🔄 Continuous Improvement

Weekly Review (Mondays at 10 AM)

Analyze error patterns from past week
Identify agents needing attention
Update monitoring thresholds
Refine alert logic

Monthly Audit

Review all agent configurations
Verify API credentials still valid
Check for deprecated endpoints
Optimize schedules for efficiency

🎯 Success Metrics

Goal: Zero undetected agent failures

Measure:

Time from failure to detection: <1 hour
Time from detection to resolution: <4 hours (for critical)
Percentage of issues caught proactively: 100%
Business impact from agent failures: Zero (catch before impact)

📞 Escalation Path

If Forge detects an issue it cannot fix autonomously:

Immediate: Alert Chris via Telegram with full context
If no response in 4 hours: Re-alert with urgency
If critical (revenue impact): Suggest manual intervention
Document: Log what prevented autonomous resolution

🔧 Tools & Commands

# Quick health check
openclaw cron list

# Detailed job info
openclaw cron list --json | python3 -m json.tool

# Fix Telegram delivery
openclaw cron edit <job-id> --channel "telegram" --to "telegram:8269921691"

# Remove broken agent
openclaw cron rm <job-id>

# View logs
tail -50 /Users/claw/.openclaw/workspace/agents/<agent>/logs/*.log

Commitment: Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues.

Last Updated: April 8, 2026
Next Review: April 15, 2026 (weekly)

5.1 KiB Raw Permalink Blame History