Files
HOA_Financial_Platform/docs/AI_FEATURE_AUDIT.md
olsch01 07d15001ae fix: improve AI health score accuracy and consistency
Address 4 issues identified in AI feature audit:

1. Reduce temperature from 0.3 to 0.1 for health score calculations
   to reduce 16-40 point score volatility across runs

2. Add explicit cash runway classification rules to operating prompt
   preventing the model from rating sub-3-month runway as "positive"

3. Pre-compute total special assessment income in both operating and
   reserve prompts, eliminating per-unit vs total confusion ($300
   vs $20,100)

4. Make YTD budget comparison actuals-aware: only compare months with
   posted journal entries, show current month budget separately, and
   add prompt guidance about month-end posting cadence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 12:44:12 -05:00

26 KiB
Raw Permalink Blame History

AI Feature Audit Report

Audit Date: 2026-03-05 Tenant Under Test: Pine Creek HOA (tenant_pine_creek_hoa_q33i) AI Model: Qwen 3.5-397B-A17B via NVIDIA NIM (Temperature: 0.3) Auditor: Claude Opus 4.6 (automated) Data Snapshot Date: 2026-03-04


Executive Summary

Three AI-powered features were audited against ground-truth database records: Operating Fund Health, Reserve Fund Health, and Investment Recommendations. Overall, the AI demonstrates strong financial reasoning and produces actionable, fiduciary-appropriate recommendations. However, score consistency across runs is a concern (16-point spread on operating, 20-point spread on reserve), and several specific data interpretation issues were identified.

Feature Latest Score/Grade Concurrence Verdict
Operating Fund Health 88 / Good 72% Score ~10-15 pts high; cash runway below its own "Good" threshold
Reserve Fund Health 45 / Needs Attention 85% Well-calibrated; minor data misquote on annual contributions
Investment Recommendations 6 recommendations 88% Excellent specificity; all market rates verified accurate

Data Foundation (Ground Truth)

Financial Position

Metric Value Source
Operating Cash (Checking) $27,418.81 GL balance
Reserve Cash (Savings) $10,688.45 GL balance
Reserve CD #1a (FCB) $10,000 @ 3.67%, matures 6/19/26 investment_accounts
Reserve CD #2a (FCB) $8,000 @ 3.60%, matures 4/14/26 investment_accounts
Reserve CD #3a (FCB) $10,000 @ 3.67%, matures 8/18/26 investment_accounts
Total Reserve Fund $38,688.45 Cash + Investments
Total Assets $66,107.26 Operating + Reserve

Budget (FY2026)

Category Annual Total
Operating Income $184,207.40
Operating Expense $139,979.95
Net Operating Surplus $44,227.45
Monthly Expense Run Rate $11,665.00
Reserve Interest Income $1,449.96
Reserve Disbursements $22,000.00 (Mar $13K, Apr $9K)

Assessment Structure

  • 67 units at $2,328.14/year regular + $300.00/year special (annual frequency)
  • Total annual regular assessments: ~$155,985
  • Total annual special assessments: ~$20,100
  • Budget timing: assessments front-loaded in Mar-Jun

Actuals (YTD through March 4, 2026)

Metric Value
YTD Income $88.16 (ARC fees $100 - $50 adj + $38.16 interest)
YTD Expenses $1,850.42 (January only)
Delinquent Invoices 0 ($0.00)
Journal Entries Posted 4 (Jan actuals + Feb adjusting + Feb opening balances)

Capital Projects (from projects table, 26 total)

Project Cost Target Funded %
Pond Spillway $7,000 Mar 2026 0%
Tuscany Drain Box $5,500 May 2026 0%
Front Entrance Power Washing $1,500 Mar 2027 0%
Irrigation Pump Replacement $1,500 Jun 2027 0%
Road Sealing - All Roads $80,000 Jun 2029 0%
Asphalt Repair - Creek Stone Dr $43,000 TBD 0%
Pavilion & Vineyard Structures $7,000 Jun 2035 0%
16 placeholder items $1.00 each TBD 0%
Total Planned $152,016 0%

Reserve Components

  • 0 components tracked (empty reserve_components table)

Market Rates (fetched 2026-03-04)

Type Top Rate Bank Term
CD 4.10% E*TRADE / Synchrony 12-14 mo
High-Yield Savings 4.09% Openbank Liquid
Money Market 4.03% Vio Bank Liquid

1. Operating Fund Health Score

Latest Score: 88 (Good) — Generated 2026-03-04T19:24:36Z Score History: 48 → 72 → 78 → 72 → 78 → 88 (6 runs, March 2-4) Overall Concurrence: 72%

Factor-by-Factor Analysis

Factor 1: "Projected Cash Flow" — Impact: Positive

"12-month forecast shows consistent positive liquidity, with cash balances never dipping below the starting $27,419 and peaking at $142,788 in June."

Check Result
Budget surplus ($184K income vs $140K expense) Verified
Assessments front-loaded Mar-Jun Verified (budget shows $48K Mar, $64K Apr, $32K May, $16K Jun)
Peak of ~$142K in June Plausible ($27K + cumulative income through June)
Cash never below starting $27K Plausible (expenses < income by month)

Concurrence: 95% — Forecast logic is sound. The only risk is the assumption that assessments are collected on the exact budget schedule.


Factor 2: "Delinquency Rate" — Impact: Positive

"$0.00 in overdue invoices and a 0.0% delinquency rate."

Concurrence: 100% — Database confirms zero delinquent invoices.


Factor 3: "Budget Performance (Timing)" — Impact: Neutral

"YTD income is 99.8% below budget ($55k variance) primarily due to the timing of the large Special Assessment ($20,700) and regular assessments appearing in future projected months."

Check Result
YTD income $88.16 Verified
Budget includes March ($55K) in YTD calc Accurate — AI uses month 3 of 12, includes full March budget
Timing explanation Reasonable — we're only 4 days into March
Rating as "neutral" vs "negative" Appropriate — correctly avoids penalizing for calendar timing

Concurrence: 80% — The variance is accurately computed but presenting a $55K "variance" when we're 4 days into March could alarm a board member. The YTD window through month 3 includes all of March's budget despite only 4 days having elapsed. Consider computing YTD budget pro-rata or through the prior complete month.

🔧 Tuning Suggestion: Add a note to the prompt about pro-rating the current month's budget, or instruct the AI to note "X days into the current month" when the variance is driven by incomplete-month timing.


Factor 4: "Cash Reserves" — Impact: Positive

"Current operating cash of $27,419 provides 2.4 months of runway based on the annual expense run rate."

Check Result
$27,419 / ($139,980 / 12) = 2.35 months Math verified
Rated as "positive" Questionable ⚠️

Concurrence: 60% — The math is correct, but rating 2.4 months as "positive" contradicts the scoring guidelines which state 2-3 months = "Fair" (60-74) and 3-6 months = "Good" (75-89). This factor should be "neutral" at best, and the overall score should reflect that the HOA is below the "Good" threshold for cash reserves.

🔧 Tuning Suggestion: Add explicit guidance in the prompt: "If cash runway is below 3 months, this factor MUST be neutral or negative, regardless of projected future inflows."


Factor 5: "Expense Management" — Impact: Positive

"YTD expenses are $36,313 under budget (4.8% of annual budget spent vs 25% of year elapsed)."

Check Result
YTD expenses $1,850.42 Verified
Budget YTD (3 months): ~$38,164 Correct
$1,850 / $38,164 = 4.85% Math verified
"25% of year elapsed" Correct (month 3 of 12)
Phrasing "of annual budget" Misleading ⚠️ — it's actually 4.8% of YTD budget, not annual

Concurrence: 70% — The percentage is correctly calculated against YTD budget, but the phrasing "of annual budget" is incorrect. Also, the low spend is not necessarily positive — only January actuals exist; February hasn't been posted yet, which the AI partially acknowledges with "or delayed billing cycles."


Recommendation Assessment

# Recommendation Priority Concurrence
1 "Verify the posting schedule for the $20,700 Special Assessment" Low 90% Valid; assessments are annual, collection timing matters
2 "Investigate the low YTD expense recognition ($1,850 vs $38,164)" Medium 95% Excellent catch; Feb expenses not posted yet
3 "Consider moving excess cash over $100K in Q2 to interest-bearing account" Low 85% Sound advice; aligns with HY Savings at 4.09%

Recommendation Concurrence: 90% — All three recommendations are actionable and data-backed.


Score Assessment

Is 88 (Good) the right score?

Scoring Criterion Guidelines Say Actual Alignment
Cash reserves 3-6 months for "Good" 2.4 months Below threshold
Income vs expenses "Roughly matching" for Good $184K vs $140K (surplus) Exceeds
Delinquency "Manageable" for Good 0% Excellent
Budget performance No major overruns for Good Under budget (timing) Positive
Projected cash flow Not explicitly in guidelines Strong positive trajectory Positive

The cash runway of 2.4 months is below the stated "Good" (75-89) threshold of 3-6 months and technically falls in the "Fair" (60-74) range of 2-3 months. Earlier AI runs scored this 72-78, which better aligns with the guidelines. The 88 appears to overweight the projected future cash flow (which is speculative) vs the current actual position.

Suggested correct score: 74-80 (high end of Fair to low end of Good)


Score Consistency Concern

Run Date Score Label
Mar 2 15:07 48 Needs Attention
Mar 2 15:12 78 Good
Mar 2 15:36 72 Fair
Mar 2 17:09 78 Good
Mar 3 02:03 72 Fair
Mar 4 19:24 88 Good

A 40-point spread (48-88) across 6 runs with essentially the same data is concerning. Even excluding the outlier first run (which noted a data config issue with "1 units"), the remaining 5 runs span 72-88 (16 points). At temperature 0.3, this suggests the model is not deterministic enough for financial scoring.

🔧 Tuning Suggestion: Consider lowering temperature to 0.1 for health score calculations to improve consistency. Alternatively, implement a moving average of the last 3 scores to smooth volatility.


2. Reserve Fund Health Score

Latest Score: 45 (Needs Attention) — Generated 2026-03-04T19:24:50Z Score History: 25 → 48 → 42 → 25 → 45 → 35 → 45 (7 runs, March 2-4) Overall Concurrence: 85%

Factor-by-Factor Analysis

Factor 1: "Funded Ratio" — Impact: Negative

"Calculated at 0% because no reserve components have been inventoried or assigned replacement costs, making it impossible to measure true funding health against the $152,016 in planned projects."

Check Result
0 reserve components in DB Verified
$152,016 in planned projects Verified (sum of all projects rows)
0% funded ratio Technically accurate (no denominator from components)
Distinction between components and projects Well articulated

Concurrence: 95% — The AI correctly identifies that the 0% is an artifact of missing reserve study data, not a literal lack of funds. It appropriately flags this as a governance failure.


Factor 2: "Projected Cash Flow" — Impact: Positive

"Strong immediate liquidity; cash balance is projected to rise from $10,688 to over $49,000 by May 2026 due to special assessment income covering the $12,500 in urgent 2026 project costs."

Check Result
Starting reserve cash $10,688 Verified
2026 project costs: $7K (Mar) + $5.5K (May) = $12,500 Verified
Special assessment: $300 × 67 = $20,100/year Verified
CD maturities: $8K (Apr), $10K (Jun), $10K (Aug) Verified
Projected rise to $49K by May Plausible (income + maturities - project costs)

Concurrence: 85% — Math is directionally correct. However, the assessment is annual frequency so the full $20,100 may arrive in a single payment, not spread monthly. The timing assumption is critical.


Factor 3: "Component Tracking" — Impact: Negative

"Critical failure in governance: 'No reserve components tracked' means the association is flying blind on the condition and remaining useful life of major assets like roads and irrigation."

Concurrence: 100% — Database confirms 0 rows in reserve_components. This is objectively a critical gap.


Factor 4: "Annual Contributions" — Impact: Negative

"Recurring annual reserve income is only $300 (plus minimal interest), which is grossly insufficient to fund the $80,000 road sealing project due in 2029."

Check Result
Reserve budget income: $1,449.96/yr (interest only) Verified
Special assessment: $300/unit × 67 = $20,100/yr Verified
"$300" cited as annual reserve income Incorrect ⚠️
Road Sealing $80K in June 2029 Verified

Concurrence: 65% — The concern about insufficient contributions is valid, but the "$300" figure appears to confuse the per-unit special assessment amount ($300/unit) with the total annual reserve income. Actual annual reserve income = $1,450 (interest) + $20,100 (special assessments) = $21,550/yr. Even at $21,550/yr, the 3 years until Road Sealing would accumulate ~$64,650, still short of $80K. So the directional concern is correct, but the magnitude is significantly misstated.

🔧 Tuning Suggestion: The prompt should explicitly label the special assessment income total (not per-unit) in the data context. Currently the data says "$300.00/unit × 67 units (annual)" — the AI should compute $20,100 but sometimes fixates on the $300 per-unit figure. Consider pre-computing and passing the total.


Recommendation Assessment

# Recommendation Priority Concurrence
1 "Commission a professional Reserve Study to inventory assets and establish funded ratio" High 100% Critical and universally correct
2 "Develop a long-term funding plan for the $80,000 Road Sealing project (2029)" High 90% Verified project exists; $80K with 0% funded
3 "Formalize collection of special assessments into the reserve fund vs operating" Medium 95% Budget shows special assessments in operating income section

Recommendation Concurrence: 95% — All recommendations are actionable, appropriately prioritized, and backed by database evidence.


Score Assessment

Is 45 (Needs Attention) the right score?

Scoring Criterion Guidelines Say Actual Alignment
Percent funded 20-30% for "Needs Attention" 0% (no components) ⬇️ Worse than threshold
Contributions "Inadequate" for Needs Attention $21,550/yr for $152K in projects ⚠️ Borderline
Component tracking "Multiple urgent unfunded" 0 tracked, 2 due in 2026 Critical gap
Investments Not scored negatively 3 CDs earning 3.6-3.67% Positive
Capital readiness $12.5K due soon, only $10.7K cash ⚠️ Tight

A score of 45 is reasonable. The 0% funded ratio technically suggests "At Risk" (20-39), but the presence of real assets ($38.7K), active investments, and manageable near-term liquidity justifies bumping it into the "Needs Attention" band. The AI's balancing of the artificial 0% metric against actual fund health shows good judgment.

Suggested correct score: 40-50 — the AI's 45 is well-calibrated.


Score Consistency Concern

Run Date Score Label
Mar 2 15:06 25 At Risk
Mar 2 15:13 25 At Risk
Mar 2 15:37 48 Needs Attention
Mar 2 17:10 42 Needs Attention
Mar 3 02:04 45 Needs Attention
Mar 4 18:49 35 At Risk
Mar 4 19:24 45 Needs Attention

A 23-point spread (25-48) across 7 runs. The scores oscillate between "At Risk" and "Needs Attention" — the model cannot consistently decide which band this falls into. The most recent 3 runs (35, 45, 45) are more stable.

🔧 Tuning Suggestion: Add boundary guidance to the prompt: "When the score falls within ±5 points of a threshold (40, 60, 75, 90), explicitly justify which side of the boundary the HOA falls on."


3. AI Investment Recommendations

Latest Run: 2026-03-04T19:28:22Z (3 runs saved) Overall Concurrence: 88%

Overall Assessment

"The HOA has a healthy long-term cash flow outlook with significant surpluses projected by mid-2026, but faces an immediate liquidity pinch in the Reserve Fund for March/April capital projects. The current investment strategy relies on older, lower-yielding CDs (3.60-3.67%) that are maturing soon."

Concurrence: 92% — Every claim verified:

  • CDs are at 3.60-3.67% vs market 4.10% (verified)
  • March project ($7K) vs reserve cash ($10.7K) is tight (verified)
  • Long-term surplus projected from assessment income (verified from budget)

Recommendation-by-Recommendation Analysis

Rec 1: "Critical Reserve Shortfall for March Project" — HIGH / Liquidity Warning

Claim Database Value Match
Reserve cash = $10,688 $10,688.45 Exact
$7,000 Pond Spillway project due March Projects table: $7,000, Mar 2026 Exact
Shortfall risk $10,688 - $7,000 = $3,688 remaining — tight but feasible
Suggested action: expedite special assessment or transfer from operating Sound advice

Concurrence: 90% — The liquidity concern is real. After paying the $7K project, only $3.7K would remain in reserve cash before the $5.5K May project. The AI correctly flags the timing risk even though the fund is technically solvent.


Rec 2: "Reinvest Maturing CD #2a at Higher Rate" — HIGH / Maturity Action

Claim Database Value Match
CD #2a = $8,000 $8,000.00 Exact
Current rate = 3.60% 3.60% Exact
Maturity = April 14, 2026 2026-04-14 Exact
Market rate = 4.10% (E*TRADE) CD rates: E*TRADE 4.10%, 1 year, $0 min Exact
Additional yield: ~$40/year per $8K $8K × 0.50% = $40 Math correct

Concurrence: 95% — Textbook-correct recommendation. Every data point verified. The 50 bps improvement is risk-free income.


Rec 3: "Establish 12-Month CD Ladder for Reserves" — MEDIUM / CD Ladder

Claim Database Value Match
~$38K total reserve portfolio $38,688.45 Exact
Suggest 4-rung ladder (3/6/9/12 mo) Standard strategy
Rates up to 4.10% Market data confirmed
$9K matures every quarter $38K / 4 = $9.5K per rung Approximate

Concurrence: 75% — Strategy is sound in principle, but the recommendation overlooks two constraints:

  1. Immediate project costs ($12.5K in 2026) must be reserved first, leaving ~$26K for laddering
  2. Investing the entire $38K is aggressive — some cash buffer should remain liquid

🔧 Tuning Suggestion: Add a constraint to the prompt: "When recommending CD ladders, always subtract upcoming project costs (next 12 months) and a minimum emergency reserve (1 month of budgeted reserve expenses) before calculating the investable amount."


Rec 4: "Deploy Excess Operating Cash to High-Yield Savings" — MEDIUM / New Investment

Claim Database Value Match
Operating cash = $27,418 $27,418.81 Exact
3-month buffer = ~$35,000 $11,665 × 3 = $34,995 Math correct
Current cash below buffer $27.4K < $35K Correctly identified
Openbank 4.09% APY Market data: Openbank 4.09%, $0.01 min Exact
Trigger: "As soon as balance exceeds $35K" Sound deferred recommendation

Concurrence: 90% — The AI correctly identifies the current shortfall and provides a forward-looking trigger. Well-structured advice that respects the liquidity constraint.


Rec 5: "Optimize Reserve Cash Yield Post-Project" — LOW / Reallocation

Claim Database Value Match
Vio Bank Money Market at 4.03% Market data: Vio Bank 4.03%, $0 min Exact
Post-project reserve cash deployment Appropriate timing
T+1 liquidity for emergencies Correct MM account characteristic

Concurrence: 85% — Reasonable low-priority optimization. Correctly uses market data.


Rec 6: "Formalize Special Assessment Collection for Reserves" — LOW / General

Claim Database Value Match
$300/unit special assessment Assessment groups: $300.00 special Exact
Risk of commingling with operating Budget shows special assessments in operating income Identified

Concurrence: 90% — Important governance recommendation. The budget structure does show special assessments as operating income, which could lead to improper fund commingling.


Risk Notes Assessment

Risk Note Verified Concurrence
"Reserve cash ($10.6K) barely sufficient for $7K + $5.5K projects" $10,688 vs $12,500 in projects 95%
"Concentration risk: CDs maturing in 4-month window (Apr-Aug)" All 3 CDs mature Apr-Aug 2026 100%
"Operating cash ballooning to $140K+ without investment plan" Budget shows large Q2 surplus 85%
"Road Sealing $80K in 2029 needs dedicated savings plan" Project exists, 0% funded 95%

Risk Notes Concurrence: 94% — All risk items are data-backed and appropriately flagged.


Cross-Run Consistency (Investment Recommendations)

Three runs were compared. Key observations:

  • Core recommendations are highly consistent across runs: CD reinvestment, HY savings for operating, CD ladder for reserves
  • Dollar amounts match exactly across all runs (same data inputs)
  • Bank name recommendations vary slightly (E*TRADE vs "Top CD Rate") — cosmetic, not substantive
  • Priority levels are stable (HIGH for liquidity warnings, MEDIUM for optimization)

Consistency Grade: A- — Investment recommendations show much better consistency than health scores, likely because the structured data (specific CDs, specific rates) constrains the output more than the subjective health scoring.


Cross-Cutting Issues

Issue 1: Score Volatility (MEDIUM Priority)

Health scores vary significantly across runs despite identical input data:

  • Operating: 40-point spread (48-88)
  • Reserve: 23-point spread (25-48)

Root Cause: Temperature 0.3 allows too much variance for numerical scoring. The model interprets guidelines subjectively.

Recommended Fix:

  1. Reduce temperature to 0.1 for health score calculations
  2. Implement a 3-run moving average to smooth individual run variance
  3. Add explicit boundary justification requirements to prompts

Issue 2: YTD Budget Calculation Includes Incomplete Month (LOW Priority)

The operating health score computes YTD budget through the current month (March), but actual data may only cover a few days. This creates alarming income variances (e.g., "$55K variance") that are pure timing artifacts.

Recommended Fix:

  • Compute YTD budget through the prior completed month (February)
  • OR pro-rate the current month's budget by days elapsed
  • Add a note to the prompt: "If the variance is driven by the current incomplete month, flag it as 'timing' and weight it minimally."

Issue 3: Per-Unit vs Total Confusion on Special Assessments (LOW Priority)

The AI sometimes quotes "$300" as the annual reserve income instead of $300 × 67 = $20,100. The data passed says "$300.00/unit × 67 units (annual)" but the model occasionally fixates on the per-unit figure.

Recommended Fix:

  • Pre-compute and include the total in the data: "Total Annual Special Assessment Income: $20,100.00"
  • Keep the per-unit breakdown for context but lead with the total

Issue 4: Cash Runway Classification Inconsistency (MEDIUM Priority)

The operating health score rates 2.4 months of cash runway as "positive" despite the scoring guidelines defining 2-3 months as "Fair" territory. This inflates the overall score.

Recommended Fix:

  • Add explicit prompt guidance: "Cash runway categorization: <2 months = negative, 2-3 months = neutral, 3-6 months = positive, 6+ months = strongly positive. Do NOT rate below-threshold runway as positive based on projected future inflows."

Issue 5: Dual Project Tables (INFORMATIONAL)

The schema contains both capital_projects (empty) and projects (26 rows). The health score service correctly queries projects, but auditors initially checked capital_projects and found no data. This dual-table pattern could confuse future developers.

Recommended Fix:

  • Consolidate into a single table, OR
  • Add a comment/documentation clarifying the canonical source

Concurrence Summary by Recommendation

Operating Fund Health — Recommendations

Recommendation Concurrence
Verify posting schedule for $20,700 Special Assessment 90%
Investigate low YTD expense recognition 95%
Move excess cash to interest-bearing account 85%
Average 90%

Reserve Fund Health — Recommendations

Recommendation Concurrence
Commission professional Reserve Study 100%
Develop funding plan for $80K Road Sealing 90%
Formalize special assessment collection for reserves 95%
Average 95%

Investment Planning — Recommendations

Recommendation Concurrence
Critical Reserve Shortfall for March Project 90%
Reinvest Maturing CD #2a at Higher Rate 95%
Establish 12-Month CD Ladder 75%
Deploy Operating Cash to HY Savings 90%
Optimize Reserve Cash Post-Project 85%
Formalize Special Assessment Collection 90%
Average 88%

Final Grades

Feature Score Accuracy Recommendation Quality Data Fidelity Consistency Overall
Operating Fund Health C+ (score ~15 pts high) A (90%) B+ (minor math phrasing) C (16-pt spread) 72% — B-
Reserve Fund Health A- (well-calibrated) A (95%) B (per-unit confusion) B- (23-pt spread) 85% — B+
Investment Recommendations N/A (no single score) A (88%) A (exact data matches) A- (stable across runs) 88% — A-

Priority Action Items for Tuning

  1. [HIGH] Reduce AI temperature from 0.3 → 0.1 for health score calculations to reduce score volatility
  2. [MEDIUM] Add explicit cash-runway-to-impact mapping in operating prompt to prevent misclassification
  3. [MEDIUM] Pre-compute total special assessment income in data context (not just per-unit)
  4. [LOW] Adjust YTD budget calculation to use prior completed month or pro-rate current month
  5. [LOW] Add boundary justification requirement to scoring prompts
  6. [LOW] Consider implementing 3-run moving average for displayed health scores

Generated by Claude Opus 4.6 — Automated AI Feature Audit