HOALedgerIQ_Website/AGENT-MONITORING-PROTOCOL.md

# Agent Health Monitoring Protocol

**Effective:** April 8, 2026
**Owner:** Forge (Autonomous Operations Bot)
**Priority:** Critical - System Reliability

---

## 🎯 Objective

Maintain 100% awareness of all agent health status and proactively alert Chris when issues arise, before they impact business operations.

---

## 📋 Daily Health Check Protocol

### Frequency
- **Automated:** Every 4 hours during business hours (8 AM - 8 PM ET)
- **Manual Review:** Once daily at 4 PM ET (comprehensive audit)
- **Heartbeat Trigger:** When heartbeat.md check is performed

### What to Monitor

1. **Cron Job Status**
   ```bash
   openclaw cron list
   ```
   - Check for `error` status
   - Monitor `consecutiveErrors` count
   - Verify `lastRunStatus` is `ok`
   - Alert threshold: >3 consecutive errors

2. **Agent Logs**
   - Review `/logs/` directories for each agent
   - Look for repeated failures or timeouts
   - Check for successful API calls

3. **Business Critical Paths**
   - **Sales leads:** Verify leads are being detected and notified
   - **Revenue systems:** ROI Calculator & Interest Form APIs
   - **Daily reports:** Confirm delivery of morning brief, SEO report, workout

4. **System Resources**
   - Disk space in workspace
   - API rate limits (GA4, Reddit, etc.)
   - Network connectivity to endpoints

---

## 🚨 Alert Triggers

### Critical (Immediate Alert Required)
- [ ] Sales lead monitor fails (missing leads = lost revenue)
- [ ] API endpoints unreachable (calc-submissions, interest form)
- [ ] Multiple agents failing simultaneously (systemic issue)
- [ ] Database or data corruption detected

### High Priority (Alert Within 1 Hour)
- [ ] Any agent with >10 consecutive errors
- [ ] Daily reports not delivered (SEO, morning brief, workout)
- [ ] Reddit scout missing opportunities
- [ ] JAE/Tier-1 scorer not processing leads

### Medium Priority (Include in Next Status Update)
- [ ] Single agent failure with <10 errors
- [ ] Occasional timeout or network error
- [ ] Non-critical feature degradation

---

## 📊 Status Report Format

When reporting issues to Chris, use this format:

```
🔔 *AGENT HEALTH ALERT* - [Severity]

**Issue:** [Brief description]
**Affected Agent:** [agent-name]
**Impact:** [Business impact - e.g., "Not detecting leads", "Missing daily report"]
**Errors:** [X] consecutive failures
**Last Successful Run:** [timestamp]

**Root Cause:** [If known]
**Fix Applied:** [If already fixed]
**Action Required:** [What Chris needs to do, if anything]

**System Status:**
✅ Operational: [list critical agents working]
⚠️ Degraded: [list agents with issues]
❌ Down: [list completely failed agents]
```

---

## 🛠️ Standard Fixes (Autonomous)

Forge can and should fix these without asking:

1. **Telegram Delivery Failures**
   - Change delivery target from broken `@heartbeat` to `telegram:8269921691`
   - Verify fix on next run

2. **Duplicate Agents**
   - Remove old/bash versions when Python versions exist
   - Keep Chris informed of cleanup

3. **Temporary Network Issues**
   - Retry failed API calls
   - Monitor for pattern vs. one-off

4. **State File Corruption**
   - Reset state files if corrupted
   - Preserve processed lead IDs to avoid re-notification

---

## 📝 Documentation Requirements

After any issue or fix:

1. **Update AGENT-HEALTH-AUDIT.md** with current status
2. **Log incident** in `memory/YYYY-MM-DD.md` with:
   - What failed
   - Root cause
   - Fix applied
   - Prevention strategy
3. **Self-improvement:** If a pattern emerges, create/update monitoring rules

---

## 🔄 Continuous Improvement

### Weekly Review (Mondays at 10 AM)
- Analyze error patterns from past week
- Identify agents needing attention
- Update monitoring thresholds
- Refine alert logic

### Monthly Audit
- Review all agent configurations
- Verify API credentials still valid
- Check for deprecated endpoints
- Optimize schedules for efficiency

---

## 🎯 Success Metrics

**Goal:** Zero undetected agent failures

**Measure:**
- Time from failure to detection: <1 hour
- Time from detection to resolution: <4 hours (for critical)
- Percentage of issues caught proactively: 100%
- Business impact from agent failures: Zero (catch before impact)

---

## 📞 Escalation Path

If Forge detects an issue it cannot fix autonomously:

1. **Immediate:** Alert Chris via Telegram with full context
2. **If no response in 4 hours:** Re-alert with urgency
3. **If critical (revenue impact):** Suggest manual intervention
4. **Document:** Log what prevented autonomous resolution

---

## 🔧 Tools & Commands

```bash
# Quick health check
openclaw cron list

# Detailed job info
openclaw cron list --json | python3 -m json.tool

# Fix Telegram delivery
openclaw cron edit <job-id> --channel "telegram" --to "telegram:8269921691"

# Remove broken agent
openclaw cron rm <job-id>

# View logs
tail -50 /Users/claw/.openclaw/workspace/agents/<agent>/logs/*.log
```

---

**Commitment:** Chris should never discover agent failures through missing outputs - Forge will always be the first to detect and report issues.

**Last Updated:** April 8, 2026
**Next Review:** April 15, 2026 (weekly)