Files
HOALedgerIQ_Website/AGENT-MONITORING-PROTOCOL.md
olsch01 1bd3e724fe feat: Proactive agent health monitoring system
- Created AGENT-MONITORING-PROTOCOL.md - formal monitoring procedures
- Added automated health check script (runs every 4 hours)
- Monitors all cron jobs for errors and consecutive failures
- Alerts Chris via Telegram when issues detected
- Documents escalation paths and standard fixes
- Establishes success metrics: zero undetected failures

This ensures system reliability through proactive detection.
2026-04-08 11:52:53 -04:00

5.1 KiB

Agent Health Monitoring Protocol

Effective: April 8, 2026
Owner: Forge (Autonomous Operations Bot)
Priority: Critical - System Reliability


🎯 Objective

Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations.


📋 Daily Health Check Protocol

Frequency

  • Automated: Every 4 hours during business hours (8 AM - 8 PM ET)
  • Manual Review: Once daily at 4 PM ET (comprehensive audit)
  • Heartbeat Trigger: When heartbeat.md check is performed

What to Monitor

  1. Cron Job Status

    openclaw cron list
    
    • Check for error status
    • Monitor consecutiveErrors count
    • Verify lastRunStatus is ok
    • Alert threshold: >3 consecutive errors
  2. Agent Logs

    • Review /logs/ directories for each agent
    • Look for repeated failures or timeouts
    • Check for successful API calls
  3. Business Critical Paths

    • Sales leads: Verify leads are being detected and notified
    • Revenue systems: ROI Calculator & Interest Form APIs
    • Daily reports: Confirm delivery of morning brief, SEO report, workout
  4. System Resources

    • Disk space in workspace
    • API rate limits (GA4, Reddit, etc.)
    • Network connectivity to endpoints

🚨 Alert Triggers

Critical (Immediate Alert Required)

  • Sales lead monitor fails (missing leads = lost revenue)
  • API endpoints unreachable (calc-submissions, interest form)
  • Multiple agents failing simultaneously (systemic issue)
  • Database or data corruption detected

High Priority (Alert Within 1 Hour)

  • Any agent with >10 consecutive errors
  • Daily reports not delivered (SEO, morning brief, workout)
  • Reddit scout missing opportunities
  • JAE/Tier-1 scorer not processing leads

Medium Priority (Include in Next Status Update)

  • Single agent failure with <10 errors
  • Occasional timeout or network error
  • Non-critical feature degradation

📊 Status Report Format

When reporting issues to Chris, use this format:

🔔 *AGENT HEALTH ALERT* - [Severity]

**Issue:** [Brief description]
**Affected Agent:** [agent-name]
**Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"]
**Errors:** [X] consecutive failures
**Last Successful Run:** [timestamp]

**Root Cause:** [If known]
**Fix Applied:** [If already fixed]
**Action Required:** [What Chris needs to do, if anything]

**System Status:**
✅ Operational: [list critical agents working]
⚠️ Degraded: [list agents with issues]
❌ Down: [list completely failed agents]

🛠️ Standard Fixes (Autonomous)

Forge can and should fix these without asking:

  1. Telegram Delivery Failures

    • Change delivery target from broken @heartbeat to telegram:8269921691
    • Verify fix on next run
  2. Duplicate Agents

    • Remove old/bash versions when Python versions exist
    • Keep Chris informed of cleanup
  3. Temporary Network Issues

    • Retry failed API calls
    • Monitor for pattern vs. one-off
  4. State File Corruption

    • Reset state files if corrupted
    • Preserve processed lead IDs to avoid re-notification

📝 Documentation Requirements

After any issue or fix:

  1. Update AGENT-HEALTH-AUDIT.md with current status
  2. Log incident in memory/YYYY-MM-DD.md with:
    • What failed
    • Root cause
    • Fix applied
    • Prevention strategy
  3. Self-improvement: If a pattern emerges, create/update monitoring rules

🔄 Continuous Improvement

Weekly Review (Mondays at 10 AM)

  • Analyze error patterns from past week
  • Identify agents needing attention
  • Update monitoring thresholds
  • Refine alert logic

Monthly Audit

  • Review all agent configurations
  • Verify API credentials still valid
  • Check for deprecated endpoints
  • Optimize schedules for efficiency

🎯 Success Metrics

Goal: Zero undetected agent failures

Measure:

  • Time from failure to detection: <1 hour
  • Time from detection to resolution: <4 hours (for critical)
  • Percentage of issues caught proactively: 100%
  • Business impact from agent failures: Zero (catch before impact)

📞 Escalation Path

If Forge detects an issue it cannot fix autonomously:

  1. Immediate: Alert Chris via Telegram with full context
  2. If no response in 4 hours: Re-alert with urgency
  3. If critical (revenue impact): Suggest manual intervention
  4. Document: Log what prevented autonomous resolution

🔧 Tools & Commands

# Quick health check
openclaw cron list

# Detailed job info
openclaw cron list --json | python3 -m json.tool

# Fix Telegram delivery
openclaw cron edit <job-id> --channel "telegram" --to "telegram:8269921691"

# Remove broken agent
openclaw cron rm <job-id>

# View logs
tail -50 /Users/claw/.openclaw/workspace/agents/<agent>/logs/*.log

Commitment: Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues.

Last Updated: April 8, 2026
Next Review: April 15, 2026 (weekly)