fix: improve AI health score accuracy and consistency
Address 4 issues identified in AI feature audit: 1. Reduce temperature from 0.3 to 0.1 for health score calculations to reduce 16-40 point score volatility across runs 2. Add explicit cash runway classification rules to operating prompt preventing the model from rating sub-3-month runway as "positive" 3. Pre-compute total special assessment income in both operating and reserve prompts, eliminating per-unit vs total confusion ($300 vs $20,100) 4. Make YTD budget comparison actuals-aware: only compare months with posted journal entries, show current month budget separately, and add prompt guidance about month-end posting cadence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
545
docs/AI_FEATURE_AUDIT.md
Normal file
545
docs/AI_FEATURE_AUDIT.md
Normal file
@@ -0,0 +1,545 @@
|
||||
# AI Feature Audit Report
|
||||
|
||||
**Audit Date:** 2026-03-05
|
||||
**Tenant Under Test:** Pine Creek HOA (`tenant_pine_creek_hoa_q33i`)
|
||||
**AI Model:** Qwen 3.5-397B-A17B via NVIDIA NIM (Temperature: 0.3)
|
||||
**Auditor:** Claude Opus 4.6 (automated)
|
||||
**Data Snapshot Date:** 2026-03-04
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Three AI-powered features were audited against ground-truth database records: **Operating Fund Health**, **Reserve Fund Health**, and **Investment Recommendations**. Overall, the AI demonstrates strong financial reasoning and produces actionable, fiduciary-appropriate recommendations. However, score consistency across runs is a concern (16-point spread on operating, 20-point spread on reserve), and several specific data interpretation issues were identified.
|
||||
|
||||
| Feature | Latest Score/Grade | Concurrence | Verdict |
|
||||
|---|---|---|---|
|
||||
| Operating Fund Health | 88 / Good | **72%** | Score ~10-15 pts high; cash runway below its own "Good" threshold |
|
||||
| Reserve Fund Health | 45 / Needs Attention | **85%** | Well-calibrated; minor data misquote on annual contributions |
|
||||
| Investment Recommendations | 6 recommendations | **88%** | Excellent specificity; all market rates verified accurate |
|
||||
|
||||
---
|
||||
|
||||
## Data Foundation (Ground Truth)
|
||||
|
||||
### Financial Position
|
||||
|
||||
| Metric | Value | Source |
|
||||
|---|---|---|
|
||||
| Operating Cash (Checking) | $27,418.81 | GL balance |
|
||||
| Reserve Cash (Savings) | $10,688.45 | GL balance |
|
||||
| Reserve CD #1a (FCB) | $10,000 @ 3.67%, matures 6/19/26 | `investment_accounts` |
|
||||
| Reserve CD #2a (FCB) | $8,000 @ 3.60%, matures 4/14/26 | `investment_accounts` |
|
||||
| Reserve CD #3a (FCB) | $10,000 @ 3.67%, matures 8/18/26 | `investment_accounts` |
|
||||
| Total Reserve Fund | $38,688.45 | Cash + Investments |
|
||||
| Total Assets | $66,107.26 | Operating + Reserve |
|
||||
|
||||
### Budget (FY2026)
|
||||
|
||||
| Category | Annual Total |
|
||||
|---|---|
|
||||
| Operating Income | $184,207.40 |
|
||||
| Operating Expense | $139,979.95 |
|
||||
| **Net Operating Surplus** | **$44,227.45** |
|
||||
| Monthly Expense Run Rate | $11,665.00 |
|
||||
| Reserve Interest Income | $1,449.96 |
|
||||
| Reserve Disbursements | $22,000.00 (Mar $13K, Apr $9K) |
|
||||
|
||||
### Assessment Structure
|
||||
|
||||
- **67 units** at $2,328.14/year regular + $300.00/year special (annual frequency)
|
||||
- Total annual regular assessments: ~$155,985
|
||||
- Total annual special assessments: ~$20,100
|
||||
- Budget timing: assessments front-loaded in Mar-Jun
|
||||
|
||||
### Actuals (YTD through March 4, 2026)
|
||||
|
||||
| Metric | Value |
|
||||
|---|---|
|
||||
| YTD Income | $88.16 (ARC fees $100 - $50 adj + $38.16 interest) |
|
||||
| YTD Expenses | $1,850.42 (January only) |
|
||||
| Delinquent Invoices | 0 ($0.00) |
|
||||
| Journal Entries Posted | 4 (Jan actuals + Feb adjusting + Feb opening balances) |
|
||||
|
||||
### Capital Projects (from `projects` table, 26 total)
|
||||
|
||||
| Project | Cost | Target | Funded % |
|
||||
|---|---|---|---|
|
||||
| Pond Spillway | $7,000 | Mar 2026 | 0% |
|
||||
| Tuscany Drain Box | $5,500 | May 2026 | 0% |
|
||||
| Front Entrance Power Washing | $1,500 | Mar 2027 | 0% |
|
||||
| Irrigation Pump Replacement | $1,500 | Jun 2027 | 0% |
|
||||
| **Road Sealing - All Roads** | **$80,000** | **Jun 2029** | **0%** |
|
||||
| Asphalt Repair - Creek Stone Dr | $43,000 | TBD | 0% |
|
||||
| Pavilion & Vineyard Structures | $7,000 | Jun 2035 | 0% |
|
||||
| 16 placeholder items | $1.00 each | TBD | 0% |
|
||||
| **Total Planned** | **$152,016** | | **0%** |
|
||||
|
||||
### Reserve Components
|
||||
|
||||
- **0 components tracked** (empty `reserve_components` table)
|
||||
|
||||
### Market Rates (fetched 2026-03-04)
|
||||
|
||||
| Type | Top Rate | Bank | Term |
|
||||
|---|---|---|---|
|
||||
| CD | 4.10% | E*TRADE / Synchrony | 12-14 mo |
|
||||
| High-Yield Savings | 4.09% | Openbank | Liquid |
|
||||
| Money Market | 4.03% | Vio Bank | Liquid |
|
||||
|
||||
---
|
||||
|
||||
## 1. Operating Fund Health Score
|
||||
|
||||
**Latest Score:** 88 (Good) — Generated 2026-03-04T19:24:36Z
|
||||
**Score History:** 48 → 72 → 78 → 72 → 78 → **88** (6 runs, March 2-4)
|
||||
**Overall Concurrence: 72%**
|
||||
|
||||
### Factor-by-Factor Analysis
|
||||
|
||||
#### Factor 1: "Projected Cash Flow" — Impact: Positive
|
||||
> "12-month forecast shows consistent positive liquidity, with cash balances never dipping below the starting $27,419 and peaking at $142,788 in June."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| Budget surplus ($184K income vs $140K expense) | **Verified** ✅ |
|
||||
| Assessments front-loaded Mar-Jun | **Verified** ✅ (budget shows $48K Mar, $64K Apr, $32K May, $16K Jun) |
|
||||
| Peak of ~$142K in June | **Plausible** ✅ ($27K + cumulative income through June) |
|
||||
| Cash never below starting $27K | **Plausible** ✅ (expenses < income by month) |
|
||||
|
||||
**Concurrence: 95%** — Forecast logic is sound. The only risk is the assumption that assessments are collected on the exact budget schedule.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 2: "Delinquency Rate" — Impact: Positive
|
||||
> "$0.00 in overdue invoices and a 0.0% delinquency rate."
|
||||
|
||||
**Concurrence: 100%** ✅ — Database confirms zero delinquent invoices.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 3: "Budget Performance (Timing)" — Impact: Neutral
|
||||
> "YTD income is 99.8% below budget ($55k variance) primarily due to the timing of the large Special Assessment ($20,700) and regular assessments appearing in future projected months."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| YTD income $88.16 | **Verified** ✅ |
|
||||
| Budget includes March ($55K) in YTD calc | **Accurate** — AI uses month 3 of 12, includes full March budget |
|
||||
| Timing explanation | **Reasonable** — we're only 4 days into March |
|
||||
| Rating as "neutral" vs "negative" | **Appropriate** ✅ — correctly avoids penalizing for calendar timing |
|
||||
|
||||
**Concurrence: 80%** — The variance is accurately computed but presenting a $55K "variance" when we're 4 days into March could alarm a board member. The YTD window through month 3 includes all of March's budget despite only 4 days having elapsed. Consider computing YTD budget pro-rata or through the prior complete month.
|
||||
|
||||
**🔧 Tuning Suggestion:** Add a note to the prompt about pro-rating the current month's budget, or instruct the AI to note "X days into the current month" when the variance is driven by incomplete-month timing.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 4: "Cash Reserves" — Impact: Positive
|
||||
> "Current operating cash of $27,419 provides 2.4 months of runway based on the annual expense run rate."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| $27,419 / ($139,980 / 12) = 2.35 months | **Math verified** ✅ |
|
||||
| Rated as "positive" | **Questionable** ⚠️ |
|
||||
|
||||
**Concurrence: 60%** — The math is correct, but rating 2.4 months as "positive" contradicts the scoring guidelines which state 2-3 months = "Fair" (60-74) and 3-6 months = "Good" (75-89). This factor should be "neutral" at best, and the overall score should reflect that the HOA is *below* the "Good" threshold for cash reserves.
|
||||
|
||||
**🔧 Tuning Suggestion:** Add explicit guidance in the prompt: "If cash runway is below 3 months, this factor MUST be neutral or negative, regardless of projected future inflows."
|
||||
|
||||
---
|
||||
|
||||
#### Factor 5: "Expense Management" — Impact: Positive
|
||||
> "YTD expenses are $36,313 under budget (4.8% of annual budget spent vs 25% of year elapsed)."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| YTD expenses $1,850.42 | **Verified** ✅ |
|
||||
| Budget YTD (3 months): ~$38,164 | **Correct** ✅ |
|
||||
| $1,850 / $38,164 = 4.85% | **Math verified** ✅ |
|
||||
| "25% of year elapsed" | **Correct** (month 3 of 12) |
|
||||
| Phrasing "of annual budget" | **Misleading** ⚠️ — it's actually 4.8% of YTD budget, not annual |
|
||||
|
||||
**Concurrence: 70%** — The percentage is correctly calculated against YTD budget, but the phrasing "of annual budget" is incorrect. Also, the low spend is not necessarily positive — only January actuals exist; February hasn't been posted yet, which the AI partially acknowledges with "or delayed billing cycles."
|
||||
|
||||
---
|
||||
|
||||
### Recommendation Assessment
|
||||
|
||||
| # | Recommendation | Priority | Concurrence |
|
||||
|---|---|---|---|
|
||||
| 1 | "Verify the posting schedule for the $20,700 Special Assessment" | Low | **90%** ✅ Valid; assessments are annual, collection timing matters |
|
||||
| 2 | "Investigate the low YTD expense recognition ($1,850 vs $38,164)" | Medium | **95%** ✅ Excellent catch; Feb expenses not posted yet |
|
||||
| 3 | "Consider moving excess cash over $100K in Q2 to interest-bearing account" | Low | **85%** ✅ Sound advice; aligns with HY Savings at 4.09% |
|
||||
|
||||
**Recommendation Concurrence: 90%** — All three recommendations are actionable and data-backed.
|
||||
|
||||
---
|
||||
|
||||
### Score Assessment
|
||||
|
||||
**Is 88 (Good) the right score?**
|
||||
|
||||
| Scoring Criterion | Guidelines Say | Actual | Alignment |
|
||||
|---|---|---|---|
|
||||
| Cash reserves | 3-6 months for "Good" | 2.4 months | ❌ Below threshold |
|
||||
| Income vs expenses | "Roughly matching" for Good | $184K vs $140K (surplus) | ✅ Exceeds |
|
||||
| Delinquency | "Manageable" for Good | 0% | ✅ Excellent |
|
||||
| Budget performance | No major overruns for Good | Under budget (timing) | ✅ Positive |
|
||||
| Projected cash flow | Not explicitly in guidelines | Strong positive trajectory | ✅ Positive |
|
||||
|
||||
The cash runway of 2.4 months is below the stated "Good" (75-89) threshold of 3-6 months and technically falls in the "Fair" (60-74) range of 2-3 months. Earlier AI runs scored this 72-78, which better aligns with the guidelines. The 88 appears to overweight the projected future cash flow (which is speculative) vs the current actual position.
|
||||
|
||||
**Suggested correct score: 74-80** (high end of Fair to low end of Good)
|
||||
|
||||
---
|
||||
|
||||
### Score Consistency Concern
|
||||
|
||||
| Run Date | Score | Label |
|
||||
|---|---|---|
|
||||
| Mar 2 15:07 | 48 | Needs Attention |
|
||||
| Mar 2 15:12 | 78 | Good |
|
||||
| Mar 2 15:36 | 72 | Fair |
|
||||
| Mar 2 17:09 | 78 | Good |
|
||||
| Mar 3 02:03 | 72 | Fair |
|
||||
| Mar 4 19:24 | 88 | Good |
|
||||
|
||||
A **40-point spread** (48-88) across 6 runs with essentially the same data is concerning. Even excluding the outlier first run (which noted a data config issue with "1 units"), the remaining 5 runs span 72-88 (16 points). At temperature 0.3, this suggests the model is not deterministic enough for financial scoring.
|
||||
|
||||
**🔧 Tuning Suggestion:** Consider lowering temperature to 0.1 for health score calculations to improve consistency. Alternatively, implement a moving average of the last 3 scores to smooth volatility.
|
||||
|
||||
---
|
||||
|
||||
## 2. Reserve Fund Health Score
|
||||
|
||||
**Latest Score:** 45 (Needs Attention) — Generated 2026-03-04T19:24:50Z
|
||||
**Score History:** 25 → 48 → 42 → 25 → 45 → 35 → **45** (7 runs, March 2-4)
|
||||
**Overall Concurrence: 85%**
|
||||
|
||||
### Factor-by-Factor Analysis
|
||||
|
||||
#### Factor 1: "Funded Ratio" — Impact: Negative
|
||||
> "Calculated at 0% because no reserve components have been inventoried or assigned replacement costs, making it impossible to measure true funding health against the $152,016 in planned projects."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| 0 reserve components in DB | **Verified** ✅ |
|
||||
| $152,016 in planned projects | **Verified** ✅ (sum of all `projects` rows) |
|
||||
| 0% funded ratio | **Technically accurate** ✅ (no denominator from components) |
|
||||
| Distinction between components and projects | **Well articulated** ✅ |
|
||||
|
||||
**Concurrence: 95%** — The AI correctly identifies that the 0% is an artifact of missing reserve study data, not a literal lack of funds. It appropriately flags this as a governance failure.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 2: "Projected Cash Flow" — Impact: Positive
|
||||
> "Strong immediate liquidity; cash balance is projected to rise from $10,688 to over $49,000 by May 2026 due to special assessment income covering the $12,500 in urgent 2026 project costs."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| Starting reserve cash $10,688 | **Verified** ✅ |
|
||||
| 2026 project costs: $7K (Mar) + $5.5K (May) = $12,500 | **Verified** ✅ |
|
||||
| Special assessment: $300 × 67 = $20,100/year | **Verified** ✅ |
|
||||
| CD maturities: $8K (Apr), $10K (Jun), $10K (Aug) | **Verified** ✅ |
|
||||
| Projected rise to $49K by May | **Plausible** ✅ (income + maturities - project costs) |
|
||||
|
||||
**Concurrence: 85%** — Math is directionally correct. However, the assessment is annual frequency so the full $20,100 may arrive in a single payment, not spread monthly. The timing assumption is critical.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 3: "Component Tracking" — Impact: Negative
|
||||
> "Critical failure in governance: 'No reserve components tracked' means the association is flying blind on the condition and remaining useful life of major assets like roads and irrigation."
|
||||
|
||||
**Concurrence: 100%** ✅ — Database confirms 0 rows in `reserve_components`. This is objectively a critical gap.
|
||||
|
||||
---
|
||||
|
||||
#### Factor 4: "Annual Contributions" — Impact: Negative
|
||||
> "Recurring annual reserve income is only $300 (plus minimal interest), which is grossly insufficient to fund the $80,000 road sealing project due in 2029."
|
||||
|
||||
| Check | Result |
|
||||
|---|---|
|
||||
| Reserve budget income: $1,449.96/yr (interest only) | **Verified** ✅ |
|
||||
| Special assessment: $300/unit × 67 = $20,100/yr | **Verified** ✅ |
|
||||
| "$300" cited as annual reserve income | **Incorrect** ⚠️ |
|
||||
| Road Sealing $80K in June 2029 | **Verified** ✅ |
|
||||
|
||||
**Concurrence: 65%** — The concern about insufficient contributions is valid, but the "$300" figure appears to confuse the per-unit special assessment amount ($300/unit) with the total annual reserve income. Actual annual reserve income = $1,450 (interest) + $20,100 (special assessments) = **$21,550/yr**. Even at $21,550/yr, the 3 years until Road Sealing would accumulate ~$64,650, still short of $80K. So the directional concern is correct, but the magnitude is significantly misstated.
|
||||
|
||||
**🔧 Tuning Suggestion:** The prompt should explicitly label the special assessment income total (not per-unit) in the data context. Currently the data says "$300.00/unit × 67 units (annual)" — the AI should compute $20,100 but sometimes fixates on the $300 per-unit figure. Consider pre-computing and passing the total.
|
||||
|
||||
---
|
||||
|
||||
### Recommendation Assessment
|
||||
|
||||
| # | Recommendation | Priority | Concurrence |
|
||||
|---|---|---|---|
|
||||
| 1 | "Commission a professional Reserve Study to inventory assets and establish funded ratio" | High | **100%** ✅ Critical and universally correct |
|
||||
| 2 | "Develop a long-term funding plan for the $80,000 Road Sealing project (2029)" | High | **90%** ✅ Verified project exists; $80K with 0% funded |
|
||||
| 3 | "Formalize collection of special assessments into the reserve fund vs operating" | Medium | **95%** ✅ Budget shows special assessments in operating income section |
|
||||
|
||||
**Recommendation Concurrence: 95%** — All recommendations are actionable, appropriately prioritized, and backed by database evidence.
|
||||
|
||||
---
|
||||
|
||||
### Score Assessment
|
||||
|
||||
**Is 45 (Needs Attention) the right score?**
|
||||
|
||||
| Scoring Criterion | Guidelines Say | Actual | Alignment |
|
||||
|---|---|---|---|
|
||||
| Percent funded | 20-30% for "Needs Attention" | 0% (no components) | ⬇️ Worse than threshold |
|
||||
| Contributions | "Inadequate" for Needs Attention | $21,550/yr for $152K in projects | ⚠️ Borderline |
|
||||
| Component tracking | "Multiple urgent unfunded" | 0 tracked, 2 due in 2026 | ❌ Critical gap |
|
||||
| Investments | Not scored negatively | 3 CDs earning 3.6-3.67% | ✅ Positive |
|
||||
| Capital readiness | | $12.5K due soon, only $10.7K cash | ⚠️ Tight |
|
||||
|
||||
A score of 45 is reasonable. The 0% funded ratio technically suggests "At Risk" (20-39), but the presence of real assets ($38.7K), active investments, and manageable near-term liquidity justifies bumping it into the "Needs Attention" band. The AI's balancing of the artificial 0% metric against actual fund health shows good judgment.
|
||||
|
||||
**Suggested correct score: 40-50** — the AI's 45 is well-calibrated.
|
||||
|
||||
---
|
||||
|
||||
### Score Consistency Concern
|
||||
|
||||
| Run Date | Score | Label |
|
||||
|---|---|---|
|
||||
| Mar 2 15:06 | 25 | At Risk |
|
||||
| Mar 2 15:13 | 25 | At Risk |
|
||||
| Mar 2 15:37 | 48 | Needs Attention |
|
||||
| Mar 2 17:10 | 42 | Needs Attention |
|
||||
| Mar 3 02:04 | 45 | Needs Attention |
|
||||
| Mar 4 18:49 | 35 | At Risk |
|
||||
| Mar 4 19:24 | 45 | Needs Attention |
|
||||
|
||||
A **23-point spread** (25-48) across 7 runs. The scores oscillate between "At Risk" and "Needs Attention" — the model cannot consistently decide which band this falls into. The most recent 3 runs (35, 45, 45) are more stable.
|
||||
|
||||
**🔧 Tuning Suggestion:** Add boundary guidance to the prompt: "When the score falls within ±5 points of a threshold (40, 60, 75, 90), explicitly justify which side of the boundary the HOA falls on."
|
||||
|
||||
---
|
||||
|
||||
## 3. AI Investment Recommendations
|
||||
|
||||
**Latest Run:** 2026-03-04T19:28:22Z (3 runs saved)
|
||||
**Overall Concurrence: 88%**
|
||||
|
||||
### Overall Assessment
|
||||
> "The HOA has a healthy long-term cash flow outlook with significant surpluses projected by mid-2026, but faces an immediate liquidity pinch in the Reserve Fund for March/April capital projects. The current investment strategy relies on older, lower-yielding CDs (3.60-3.67%) that are maturing soon."
|
||||
|
||||
**Concurrence: 92%** ✅ — Every claim verified:
|
||||
- CDs are at 3.60-3.67% vs market 4.10% (verified)
|
||||
- March project ($7K) vs reserve cash ($10.7K) is tight (verified)
|
||||
- Long-term surplus projected from assessment income (verified from budget)
|
||||
|
||||
---
|
||||
|
||||
### Recommendation-by-Recommendation Analysis
|
||||
|
||||
#### Rec 1: "Critical Reserve Shortfall for March Project" — HIGH / Liquidity Warning
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| Reserve cash = $10,688 | $10,688.45 | ✅ Exact |
|
||||
| $7,000 Pond Spillway project due March | Projects table: $7,000, Mar 2026 | ✅ Exact |
|
||||
| Shortfall risk | $10,688 - $7,000 = $3,688 remaining — tight but feasible | ✅ |
|
||||
| Suggested action: expedite special assessment or transfer from operating | Sound advice | ✅ |
|
||||
|
||||
**Concurrence: 90%** — The liquidity concern is real. After paying the $7K project, only $3.7K would remain in reserve cash before the $5.5K May project. The AI correctly flags the timing risk even though the fund is technically solvent.
|
||||
|
||||
---
|
||||
|
||||
#### Rec 2: "Reinvest Maturing CD #2a at Higher Rate" — HIGH / Maturity Action
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| CD #2a = $8,000 | $8,000.00 | ✅ Exact |
|
||||
| Current rate = 3.60% | 3.60% | ✅ Exact |
|
||||
| Maturity = April 14, 2026 | 2026-04-14 | ✅ Exact |
|
||||
| Market rate = 4.10% (E*TRADE) | CD rates: E*TRADE 4.10%, 1 year, $0 min | ✅ Exact |
|
||||
| Additional yield: ~$40/year per $8K | $8K × 0.50% = $40 | ✅ Math correct |
|
||||
|
||||
**Concurrence: 95%** ✅ — Textbook-correct recommendation. Every data point verified. The 50 bps improvement is risk-free income.
|
||||
|
||||
---
|
||||
|
||||
#### Rec 3: "Establish 12-Month CD Ladder for Reserves" — MEDIUM / CD Ladder
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| ~$38K total reserve portfolio | $38,688.45 | ✅ Exact |
|
||||
| Suggest 4-rung ladder (3/6/9/12 mo) | Standard strategy | ✅ |
|
||||
| Rates up to 4.10% | Market data confirmed | ✅ |
|
||||
| $9K matures every quarter | $38K / 4 = $9.5K per rung | ✅ Approximate |
|
||||
|
||||
**Concurrence: 75%** — Strategy is sound in principle, but the recommendation overlooks two constraints:
|
||||
1. **Immediate project costs ($12.5K in 2026)** must be reserved first, leaving ~$26K for laddering
|
||||
2. **Investing the entire $38K** is aggressive — some cash buffer should remain liquid
|
||||
|
||||
**🔧 Tuning Suggestion:** Add a constraint to the prompt: "When recommending CD ladders, always subtract upcoming project costs (next 12 months) and a minimum emergency reserve (1 month of budgeted reserve expenses) before calculating the investable amount."
|
||||
|
||||
---
|
||||
|
||||
#### Rec 4: "Deploy Excess Operating Cash to High-Yield Savings" — MEDIUM / New Investment
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| Operating cash = $27,418 | $27,418.81 | ✅ Exact |
|
||||
| 3-month buffer = ~$35,000 | $11,665 × 3 = $34,995 | ✅ Math correct |
|
||||
| Current cash below buffer | $27.4K < $35K | ✅ Correctly identified |
|
||||
| Openbank 4.09% APY | Market data: Openbank 4.09%, $0.01 min | ✅ Exact |
|
||||
| Trigger: "As soon as balance exceeds $35K" | Sound deferred recommendation | ✅ |
|
||||
|
||||
**Concurrence: 90%** ✅ — The AI correctly identifies the current shortfall and provides a forward-looking trigger. Well-structured advice that respects the liquidity constraint.
|
||||
|
||||
---
|
||||
|
||||
#### Rec 5: "Optimize Reserve Cash Yield Post-Project" — LOW / Reallocation
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| Vio Bank Money Market at 4.03% | Market data: Vio Bank 4.03%, $0 min | ✅ Exact |
|
||||
| Post-project reserve cash deployment | Appropriate timing | ✅ |
|
||||
| T+1 liquidity for emergencies | Correct MM account characteristic | ✅ |
|
||||
|
||||
**Concurrence: 85%** ✅ — Reasonable low-priority optimization. Correctly uses market data.
|
||||
|
||||
---
|
||||
|
||||
#### Rec 6: "Formalize Special Assessment Collection for Reserves" — LOW / General
|
||||
|
||||
| Claim | Database Value | Match |
|
||||
|---|---|---|
|
||||
| $300/unit special assessment | Assessment groups: $300.00 special | ✅ Exact |
|
||||
| Risk of commingling with operating | Budget shows special assessments in operating income | ✅ Identified |
|
||||
|
||||
**Concurrence: 90%** ✅ — Important governance recommendation. The budget structure does show special assessments as operating income, which could lead to improper fund commingling.
|
||||
|
||||
---
|
||||
|
||||
### Risk Notes Assessment
|
||||
|
||||
| Risk Note | Verified | Concurrence |
|
||||
|---|---|---|
|
||||
| "Reserve cash ($10.6K) barely sufficient for $7K + $5.5K projects" | ✅ $10,688 vs $12,500 in projects | **95%** |
|
||||
| "Concentration risk: CDs maturing in 4-month window (Apr-Aug)" | ✅ All 3 CDs mature Apr-Aug 2026 | **100%** |
|
||||
| "Operating cash ballooning to $140K+ without investment plan" | ✅ Budget shows large Q2 surplus | **85%** |
|
||||
| "Road Sealing $80K in 2029 needs dedicated savings plan" | ✅ Project exists, 0% funded | **95%** |
|
||||
|
||||
**Risk Notes Concurrence: 94%** — All risk items are data-backed and appropriately flagged.
|
||||
|
||||
---
|
||||
|
||||
### Cross-Run Consistency (Investment Recommendations)
|
||||
|
||||
Three runs were compared. Key observations:
|
||||
- **Core recommendations are highly consistent** across runs: CD reinvestment, HY savings for operating, CD ladder for reserves
|
||||
- **Dollar amounts match exactly** across all runs (same data inputs)
|
||||
- **Bank name recommendations vary slightly** (E*TRADE vs "Top CD Rate") — cosmetic, not substantive
|
||||
- **Priority levels are stable** (HIGH for liquidity warnings, MEDIUM for optimization)
|
||||
|
||||
**Consistency Grade: A-** — Investment recommendations show much better consistency than health scores, likely because the structured data (specific CDs, specific rates) constrains the output more than the subjective health scoring.
|
||||
|
||||
---
|
||||
|
||||
## Cross-Cutting Issues
|
||||
|
||||
### Issue 1: Score Volatility (MEDIUM Priority)
|
||||
|
||||
Health scores vary significantly across runs despite identical input data:
|
||||
- Operating: 40-point spread (48-88)
|
||||
- Reserve: 23-point spread (25-48)
|
||||
|
||||
**Root Cause:** Temperature 0.3 allows too much variance for numerical scoring. The model interprets guidelines subjectively.
|
||||
|
||||
**Recommended Fix:**
|
||||
1. Reduce temperature to **0.1** for health score calculations
|
||||
2. Implement a **3-run moving average** to smooth individual run variance
|
||||
3. Add explicit **boundary justification** requirements to prompts
|
||||
|
||||
### Issue 2: YTD Budget Calculation Includes Incomplete Month (LOW Priority)
|
||||
|
||||
The operating health score computes YTD budget through the current month (March), but actual data may only cover a few days. This creates alarming income variances (e.g., "$55K variance") that are pure timing artifacts.
|
||||
|
||||
**Recommended Fix:**
|
||||
- Compute YTD budget through the **prior completed month** (February)
|
||||
- OR pro-rate the current month's budget by days elapsed
|
||||
- Add a note to the prompt: "If the variance is driven by the current incomplete month, flag it as 'timing' and weight it minimally."
|
||||
|
||||
### Issue 3: Per-Unit vs Total Confusion on Special Assessments (LOW Priority)
|
||||
|
||||
The AI sometimes quotes "$300" as the annual reserve income instead of $300 × 67 = $20,100. The data passed says "$300.00/unit × 67 units (annual)" but the model occasionally fixates on the per-unit figure.
|
||||
|
||||
**Recommended Fix:**
|
||||
- Pre-compute and include the total in the data: "Total Annual Special Assessment Income: $20,100.00"
|
||||
- Keep the per-unit breakdown for context but lead with the total
|
||||
|
||||
### Issue 4: Cash Runway Classification Inconsistency (MEDIUM Priority)
|
||||
|
||||
The operating health score rates 2.4 months of cash runway as "positive" despite the scoring guidelines defining 2-3 months as "Fair" territory. This inflates the overall score.
|
||||
|
||||
**Recommended Fix:**
|
||||
- Add explicit prompt guidance: "Cash runway categorization: <2 months = negative, 2-3 months = neutral, 3-6 months = positive, 6+ months = strongly positive. Do NOT rate below-threshold runway as positive based on projected future inflows."
|
||||
|
||||
### Issue 5: Dual Project Tables (INFORMATIONAL)
|
||||
|
||||
The schema contains both `capital_projects` (empty) and `projects` (26 rows). The health score service correctly queries `projects`, but auditors initially checked `capital_projects` and found no data. This dual-table pattern could confuse future developers.
|
||||
|
||||
**Recommended Fix:**
|
||||
- Consolidate into a single table, OR
|
||||
- Add a comment/documentation clarifying the canonical source
|
||||
|
||||
---
|
||||
|
||||
## Concurrence Summary by Recommendation
|
||||
|
||||
### Operating Fund Health — Recommendations
|
||||
| Recommendation | Concurrence |
|
||||
|---|---|
|
||||
| Verify posting schedule for $20,700 Special Assessment | 90% |
|
||||
| Investigate low YTD expense recognition | 95% |
|
||||
| Move excess cash to interest-bearing account | 85% |
|
||||
| **Average** | **90%** |
|
||||
|
||||
### Reserve Fund Health — Recommendations
|
||||
| Recommendation | Concurrence |
|
||||
|---|---|
|
||||
| Commission professional Reserve Study | 100% |
|
||||
| Develop funding plan for $80K Road Sealing | 90% |
|
||||
| Formalize special assessment collection for reserves | 95% |
|
||||
| **Average** | **95%** |
|
||||
|
||||
### Investment Planning — Recommendations
|
||||
| Recommendation | Concurrence |
|
||||
|---|---|
|
||||
| Critical Reserve Shortfall for March Project | 90% |
|
||||
| Reinvest Maturing CD #2a at Higher Rate | 95% |
|
||||
| Establish 12-Month CD Ladder | 75% |
|
||||
| Deploy Operating Cash to HY Savings | 90% |
|
||||
| Optimize Reserve Cash Post-Project | 85% |
|
||||
| Formalize Special Assessment Collection | 90% |
|
||||
| **Average** | **88%** |
|
||||
|
||||
---
|
||||
|
||||
## Final Grades
|
||||
|
||||
| Feature | Score Accuracy | Recommendation Quality | Data Fidelity | Consistency | **Overall** |
|
||||
|---|---|---|---|---|---|
|
||||
| Operating Fund Health | C+ (score ~15 pts high) | A (90%) | B+ (minor math phrasing) | C (16-pt spread) | **72% — B-** |
|
||||
| Reserve Fund Health | A- (well-calibrated) | A (95%) | B (per-unit confusion) | B- (23-pt spread) | **85% — B+** |
|
||||
| Investment Recommendations | N/A (no single score) | A (88%) | A (exact data matches) | A- (stable across runs) | **88% — A-** |
|
||||
|
||||
---
|
||||
|
||||
## Priority Action Items for Tuning
|
||||
|
||||
1. **[HIGH]** Reduce AI temperature from 0.3 → 0.1 for health score calculations to reduce score volatility
|
||||
2. **[MEDIUM]** Add explicit cash-runway-to-impact mapping in operating prompt to prevent misclassification
|
||||
3. **[MEDIUM]** Pre-compute total special assessment income in data context (not just per-unit)
|
||||
4. **[LOW]** Adjust YTD budget calculation to use prior completed month or pro-rate current month
|
||||
5. **[LOW]** Add boundary justification requirement to scoring prompts
|
||||
6. **[LOW]** Consider implementing 3-run moving average for displayed health scores
|
||||
|
||||
---
|
||||
|
||||
*Generated by Claude Opus 4.6 — Automated AI Feature Audit*
|
||||
@@ -1,587 +0,0 @@
|
||||
# HOA LedgerIQ — Deployment Guide
|
||||
|
||||
**Version:** 2026.3.2 (beta)
|
||||
**Last updated:** 2026-03-02
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Prerequisites](#prerequisites)
|
||||
2. [Deploy to a Fresh Docker Server](#deploy-to-a-fresh-docker-server)
|
||||
3. [Production Deployment](#production-deployment)
|
||||
4. [SSL with Certbot (Let's Encrypt)](#ssl-with-certbot-lets-encrypt)
|
||||
5. [Backup the Local Test Database](#backup-the-local-test-database)
|
||||
6. [Restore a Backup into the Staged Environment](#restore-a-backup-into-the-staged-environment)
|
||||
7. [Running Migrations on the Staged Environment](#running-migrations-on-the-staged-environment)
|
||||
8. [Verifying the Deployment](#verifying-the-deployment)
|
||||
9. [Environment Variable Reference](#environment-variable-reference)
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
On the **target server**, ensure the following are installed:
|
||||
|
||||
| Tool | Minimum Version |
|
||||
|-----------------|-----------------|
|
||||
| Docker Engine | 24+ |
|
||||
| Docker Compose | v2+ |
|
||||
| Git | 2.x |
|
||||
| `psql` (client) | 15+ *(optional, for manual DB work)* |
|
||||
|
||||
The app runs four containers in production — backend (NestJS), frontend
|
||||
(React/nginx), PostgreSQL 15, and Redis 7. A fifth nginx container is used
|
||||
in dev mode only. Total memory footprint is roughly **1–2 GB** idle.
|
||||
|
||||
For SSL, the server must also have:
|
||||
- A **public hostname** with a DNS A record pointing to the server's IP
|
||||
(e.g., `staging.yourdomain.com → 203.0.113.10`)
|
||||
- **Ports 80 and 443** open in any firewall / security group
|
||||
|
||||
---
|
||||
|
||||
## Deploy to a Fresh Docker Server
|
||||
|
||||
### 1. Clone the repository
|
||||
|
||||
```bash
|
||||
ssh your-staging-server
|
||||
|
||||
git clone <repo-url> /opt/hoa-ledgeriq
|
||||
cd /opt/hoa-ledgeriq
|
||||
```
|
||||
|
||||
### 2. Create the environment file
|
||||
|
||||
Copy the example and fill in real values:
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
nano .env # or vi, your choice
|
||||
```
|
||||
|
||||
**Required changes from defaults:**
|
||||
|
||||
```dotenv
|
||||
# --- CHANGE THESE ---
|
||||
POSTGRES_PASSWORD=<strong-random-password>
|
||||
JWT_SECRET=<random-64-char-string>
|
||||
|
||||
# Database URL must match the password above
|
||||
DATABASE_URL=postgresql://hoafinance:<same-password>@postgres:5432/hoafinance
|
||||
|
||||
# AI features (get a key from build.nvidia.com)
|
||||
AI_API_KEY=nvapi-xxxxxxxxxxxx
|
||||
|
||||
# --- Usually fine as-is ---
|
||||
POSTGRES_USER=hoafinance
|
||||
POSTGRES_DB=hoafinance
|
||||
REDIS_URL=redis://redis:6379
|
||||
NODE_ENV=development # keep as development for staging
|
||||
AI_API_URL=https://integrate.api.nvidia.com/v1
|
||||
AI_MODEL=qwen/qwen3.5-397b-a17b
|
||||
AI_DEBUG=false
|
||||
```
|
||||
|
||||
> **Tip:** Generate secrets quickly:
|
||||
> ```bash
|
||||
> openssl rand -hex 32 # good for JWT_SECRET
|
||||
> openssl rand -base64 24 # good for POSTGRES_PASSWORD
|
||||
> ```
|
||||
|
||||
### 3. Build and start the stack
|
||||
|
||||
```bash
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
This will:
|
||||
- Build the backend and frontend images
|
||||
- Pull `postgres:15-alpine`, `redis:7-alpine`, and `nginx:alpine`
|
||||
- Initialize the PostgreSQL database with the shared schema (`db/init/00-init.sql`)
|
||||
- Start all services on the `hoanet` bridge network
|
||||
|
||||
### 4. Wait for healthy services
|
||||
|
||||
```bash
|
||||
docker compose ps
|
||||
```
|
||||
|
||||
All containers should show `Up` (postgres and redis should also show
|
||||
`(healthy)`). If the backend is restarting, check logs:
|
||||
|
||||
```bash
|
||||
docker compose logs backend --tail=50
|
||||
```
|
||||
|
||||
### 5. (Optional) Seed with demo data
|
||||
|
||||
If deploying a fresh environment for testing and you want the Sunrise Valley
|
||||
HOA demo tenant:
|
||||
|
||||
```bash
|
||||
docker compose exec -T postgres psql -U hoafinance -d hoafinance < db/seed/seed.sql
|
||||
```
|
||||
|
||||
This creates:
|
||||
- Platform admin: `admin@hoaledgeriq.com` / `password123`
|
||||
- Tenant admin: `admin@sunrisevalley.org` / `password123`
|
||||
- Tenant viewer: `viewer@sunrisevalley.org` / `password123`
|
||||
|
||||
### 6. Access the application
|
||||
|
||||
| Service | URL |
|
||||
|-----------|--------------------------------|
|
||||
| App (UI) | `http://<server-ip>` |
|
||||
| API | `http://<server-ip>/api` |
|
||||
| Postgres | `<server-ip>:5432` (direct) |
|
||||
|
||||
> At this point the app is running over **plain HTTP** in development mode.
|
||||
> For any environment that will serve real traffic, continue to the Production
|
||||
> Deployment section.
|
||||
|
||||
---
|
||||
|
||||
## Production Deployment
|
||||
|
||||
The base `docker-compose.yml` runs everything in **development mode** (Vite
|
||||
dev server, NestJS in watch mode, no connection pooling). This is fine for
|
||||
local development but will fail under even light production load.
|
||||
|
||||
`docker-compose.prod.yml` provides a production overlay that fixes this:
|
||||
|
||||
| Component | Dev mode | Production mode |
|
||||
|-----------|----------|-----------------|
|
||||
| Frontend | Vite dev server (single-threaded, HMR) | Static build served by nginx |
|
||||
| Backend | `nest start --watch` (ts-node, file watcher) | Compiled JS, clustered across CPU cores |
|
||||
| DB pooling | None (new connection per query) | Pool of 30 reusable connections |
|
||||
| Postgres | Default config (100 connections) | Tuned: 200 connections, optimized buffers |
|
||||
| Nginx | Docker nginx routes all traffic | Disabled — host nginx routes directly |
|
||||
| Restart | None | `unless-stopped` on all services |
|
||||
|
||||
### Deploy for production
|
||||
|
||||
```bash
|
||||
cd /opt/hoa-ledgeriq
|
||||
|
||||
# Ensure .env has NODE_ENV=production and strong secrets
|
||||
nano .env
|
||||
|
||||
# Build and start with the production overlay
|
||||
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --build
|
||||
```
|
||||
|
||||
The production overlay **disables the Docker nginx container** — request routing
|
||||
and SSL are handled by the host-level nginx. Backend and frontend are exposed
|
||||
on `127.0.0.1` only (loopback), so they aren't publicly accessible without the
|
||||
host nginx in front.
|
||||
|
||||
### Host nginx setup (required for production)
|
||||
|
||||
A ready-to-use host nginx config is included at `nginx/host-production.conf`.
|
||||
It handles SSL termination, request routing, rate limiting, proxy buffering,
|
||||
and extended timeouts for AI endpoints.
|
||||
|
||||
```bash
|
||||
# Copy the reference config
|
||||
sudo cp nginx/host-production.conf /etc/nginx/sites-available/app.yourdomain.com
|
||||
|
||||
# Edit the hostname (replace all instances of app.yourdomain.com)
|
||||
sudo sed -i 's/app.yourdomain.com/YOUR_HOSTNAME/g' \
|
||||
/etc/nginx/sites-available/app.yourdomain.com
|
||||
|
||||
# Enable the site
|
||||
sudo ln -s /etc/nginx/sites-available/app.yourdomain.com /etc/nginx/sites-enabled/
|
||||
|
||||
# Get an SSL certificate (certbot modifies the config automatically)
|
||||
sudo certbot --nginx -d YOUR_HOSTNAME
|
||||
|
||||
# Test and reload
|
||||
sudo nginx -t && sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
The host config routes traffic directly to the Docker services:
|
||||
- `/api/*` → `http://127.0.0.1:3000` (NestJS backend)
|
||||
- `/` → `http://127.0.0.1:3001` (React frontend served by nginx)
|
||||
|
||||
> See `nginx/host-production.conf` for the full config including rate limiting,
|
||||
> proxy buffering, and extended AI endpoint timeouts.
|
||||
|
||||
> **Tip:** Create a shell alias to avoid typing the compose files every time:
|
||||
> ```bash
|
||||
> echo 'alias dc="docker compose -f docker-compose.yml -f docker-compose.prod.yml"' >> ~/.bashrc
|
||||
> source ~/.bashrc
|
||||
> dc up -d --build
|
||||
> ```
|
||||
|
||||
### What the production overlay does
|
||||
|
||||
**Backend (`backend/Dockerfile`)**
|
||||
- Multi-stage build: compiles TypeScript once, runs `node dist/main`
|
||||
- No dev dependencies shipped (smaller image, faster startup)
|
||||
- Node.js clustering: forks one worker per CPU core (up to 4)
|
||||
- Connection pool: 30 reusable PostgreSQL connections shared across workers
|
||||
|
||||
**Frontend (`frontend/Dockerfile`)**
|
||||
- Multi-stage build: `npm run build` produces optimized static assets
|
||||
- Served by a lightweight nginx container (not Vite)
|
||||
- Static assets cached with immutable headers (Vite filename hashing)
|
||||
|
||||
**Host Nginx (`nginx/host-production.conf`)**
|
||||
- SSL termination + HTTP→HTTPS redirect (via certbot on host)
|
||||
- Rate limiting on API routes (10 req/s per IP, burst 30)
|
||||
- Proxy buffering to prevent 502s during slow responses
|
||||
- Extended timeouts for AI endpoints (180s for investment/health-score calls)
|
||||
- Routes `/api/*` → backend:3000, `/` → frontend:3001
|
||||
|
||||
**PostgreSQL**
|
||||
- `max_connections=200` (up from default 100)
|
||||
- `shared_buffers=256MB`, `effective_cache_size=512MB`
|
||||
- Tuned checkpoint, WAL, and memory settings
|
||||
|
||||
### Capacity guidelines
|
||||
|
||||
With the production stack on a 2-core / 4GB server:
|
||||
|
||||
| Metric | Expected capacity |
|
||||
|--------|-------------------|
|
||||
| Concurrent users | 50–100 |
|
||||
| API requests/sec | ~200 |
|
||||
| DB connections | 30 per backend worker × workers |
|
||||
| Frontend serving | Static files, effectively unlimited |
|
||||
|
||||
For higher loads, scale the backend horizontally with Docker Swarm or
|
||||
Kubernetes replicas.
|
||||
|
||||
---
|
||||
|
||||
## SSL with Certbot (Let's Encrypt)
|
||||
|
||||
SSL is handled entirely at the host level using certbot with the host nginx.
|
||||
No Docker containers are involved in SSL termination.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- A public hostname with DNS pointing to this server
|
||||
- Ports 80 and 443 open in the firewall
|
||||
- Host nginx installed: `sudo apt install nginx` (Ubuntu/Debian)
|
||||
- Certbot installed: `sudo apt install certbot python3-certbot-nginx`
|
||||
|
||||
### Obtain a certificate
|
||||
|
||||
If you followed the "Host nginx setup" section above, certbot was already
|
||||
run as part of that process. If not:
|
||||
|
||||
```bash
|
||||
# Ensure the host nginx config is in place first
|
||||
sudo certbot --nginx -d YOUR_HOSTNAME
|
||||
```
|
||||
|
||||
Certbot will:
|
||||
1. Verify domain ownership via an ACME challenge on port 80
|
||||
2. Obtain the certificate from Let's Encrypt
|
||||
3. Automatically modify the nginx config to enable SSL
|
||||
4. Set up an HTTP → HTTPS redirect
|
||||
|
||||
### Verify HTTPS
|
||||
|
||||
```bash
|
||||
# Should return 200 with SSL
|
||||
curl -I https://YOUR_HOSTNAME
|
||||
|
||||
# Should return 301 redirect to HTTPS
|
||||
curl -I http://YOUR_HOSTNAME
|
||||
```
|
||||
|
||||
### Auto-renewal
|
||||
|
||||
Certbot installs a systemd timer (or cron job) that checks for renewal
|
||||
twice daily. Verify it's active:
|
||||
|
||||
```bash
|
||||
sudo systemctl status certbot.timer
|
||||
```
|
||||
|
||||
To test renewal without actually renewing:
|
||||
|
||||
```bash
|
||||
sudo certbot renew --dry-run
|
||||
```
|
||||
|
||||
Certbot automatically reloads nginx after a successful renewal.
|
||||
|
||||
---
|
||||
|
||||
## Backup the Local Test Database
|
||||
|
||||
### Full database dump (recommended)
|
||||
|
||||
From your **local development machine** where the app is currently running:
|
||||
|
||||
```bash
|
||||
cd /path/to/HOA_Financial_Platform
|
||||
|
||||
# Dump the entire database (all schemas, roles, data)
|
||||
docker compose exec -T postgres pg_dump \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
--no-owner \
|
||||
--no-privileges \
|
||||
--format=custom \
|
||||
-f /tmp/hoafinance_backup.dump
|
||||
|
||||
# Copy the dump file out of the container
|
||||
docker compose cp postgres:/tmp/hoafinance_backup.dump ./hoafinance_backup.dump
|
||||
```
|
||||
|
||||
The `--format=custom` flag produces a compressed binary format that supports
|
||||
selective restore. The file is typically 50–80% smaller than plain SQL.
|
||||
|
||||
### Alternative: Plain SQL dump
|
||||
|
||||
If you prefer a human-readable SQL file:
|
||||
|
||||
```bash
|
||||
docker compose exec -T postgres pg_dump \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
--no-owner \
|
||||
--no-privileges \
|
||||
> hoafinance_backup.sql
|
||||
```
|
||||
|
||||
### Backup a single tenant schema
|
||||
|
||||
To export just one tenant (e.g., Pine Creek HOA):
|
||||
|
||||
```bash
|
||||
docker compose exec -T postgres pg_dump \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
--no-owner \
|
||||
--no-privileges \
|
||||
--schema=tenant_pine_creek_hoa_q33i \
|
||||
> pine_creek_backup.sql
|
||||
```
|
||||
|
||||
> **Finding a tenant's schema name:**
|
||||
> ```bash
|
||||
> docker compose exec -T postgres psql -U hoafinance -d hoafinance \
|
||||
> -c "SELECT name, schema_name FROM shared.organizations WHERE status = 'active';"
|
||||
> ```
|
||||
|
||||
---
|
||||
|
||||
## Restore a Backup into the Staged Environment
|
||||
|
||||
### 1. Transfer the backup to the staging server
|
||||
|
||||
```bash
|
||||
scp hoafinance_backup.dump user@staging-server:/opt/hoa-ledgeriq/
|
||||
```
|
||||
|
||||
### 2. Ensure the stack is running
|
||||
|
||||
```bash
|
||||
cd /opt/hoa-ledgeriq
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### 3. Drop and recreate the database (clean slate)
|
||||
|
||||
```bash
|
||||
# Connect to postgres and reset the database
|
||||
docker compose exec -T postgres psql -U hoafinance -d postgres -c "
|
||||
SELECT pg_terminate_backend(pid)
|
||||
FROM pg_stat_activity
|
||||
WHERE datname = 'hoafinance' AND pid <> pg_backend_pid();
|
||||
"
|
||||
docker compose exec -T postgres dropdb -U hoafinance hoafinance
|
||||
docker compose exec -T postgres createdb -U hoafinance hoafinance
|
||||
```
|
||||
|
||||
### 4a. Restore from custom-format dump
|
||||
|
||||
```bash
|
||||
# Copy the dump into the container
|
||||
docker compose cp hoafinance_backup.dump postgres:/tmp/hoafinance_backup.dump
|
||||
|
||||
# Restore
|
||||
docker compose exec -T postgres pg_restore \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
--no-owner \
|
||||
--no-privileges \
|
||||
/tmp/hoafinance_backup.dump
|
||||
```
|
||||
|
||||
### 4b. Restore from plain SQL dump
|
||||
|
||||
```bash
|
||||
docker compose exec -T postgres psql \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
< hoafinance_backup.sql
|
||||
```
|
||||
|
||||
### 5. Restart the backend
|
||||
|
||||
After restoring, restart the backend so NestJS re-establishes its connection
|
||||
pool and picks up the restored schemas:
|
||||
|
||||
```bash
|
||||
docker compose restart backend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Running Migrations on the Staged Environment
|
||||
|
||||
Migrations live in `db/migrations/` and are numbered sequentially. After
|
||||
restoring an older backup, you may need to apply newer migrations.
|
||||
|
||||
Check which migrations exist:
|
||||
|
||||
```bash
|
||||
ls -la db/migrations/
|
||||
```
|
||||
|
||||
Apply them in order:
|
||||
|
||||
```bash
|
||||
# Run all migrations sequentially
|
||||
for f in db/migrations/*.sql; do
|
||||
echo "Applying $f ..."
|
||||
docker compose exec -T postgres psql \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
< "$f"
|
||||
done
|
||||
```
|
||||
|
||||
Or apply a specific migration:
|
||||
|
||||
```bash
|
||||
docker compose exec -T postgres psql \
|
||||
-U hoafinance \
|
||||
-d hoafinance \
|
||||
< db/migrations/010-health-scores.sql
|
||||
```
|
||||
|
||||
> **Note:** Migrations are idempotent where possible (`IF NOT EXISTS`,
|
||||
> `DO $$ ... $$` blocks), so re-running one that has already been applied
|
||||
> is generally safe.
|
||||
|
||||
---
|
||||
|
||||
## Verifying the Deployment
|
||||
|
||||
### Quick health checks
|
||||
|
||||
```bash
|
||||
# Backend is responding
|
||||
curl -s http://localhost:3000/api/auth/login | head -c 100
|
||||
|
||||
# Database is accessible
|
||||
docker compose exec -T postgres psql -U hoafinance -d hoafinance \
|
||||
-c "SELECT count(*) AS tenants FROM shared.organizations WHERE status = 'active';"
|
||||
|
||||
# Redis is working
|
||||
docker compose exec -T redis redis-cli ping
|
||||
```
|
||||
|
||||
### Full smoke test
|
||||
|
||||
1. Open `https://YOUR_HOSTNAME` (or `http://<server-ip>`) in a browser
|
||||
2. Log in with a known account
|
||||
3. Navigate to Dashboard — verify health scores load
|
||||
4. Navigate to Capital Planning — verify Kanban columns render
|
||||
5. Navigate to Projects — verify project list loads
|
||||
6. Check the Settings page — version should read **2026.3.2 (beta)**
|
||||
|
||||
### Verify SSL (if enabled)
|
||||
|
||||
```bash
|
||||
# Check certificate details
|
||||
echo | openssl s_client -connect YOUR_HOSTNAME:443 -servername YOUR_HOSTNAME 2>/dev/null \
|
||||
| openssl x509 -noout -subject -issuer -dates
|
||||
|
||||
# Check that HTTP redirects to HTTPS
|
||||
curl -sI http://YOUR_HOSTNAME | grep -E 'HTTP|Location'
|
||||
```
|
||||
|
||||
### View logs
|
||||
|
||||
```bash
|
||||
docker compose logs -f # all services
|
||||
docker compose logs -f backend # backend only
|
||||
docker compose logs -f postgres # database only
|
||||
docker compose logs -f frontend # frontend nginx
|
||||
sudo tail -f /var/log/nginx/access.log # host nginx access log
|
||||
sudo tail -f /var/log/nginx/error.log # host nginx error log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Variable Reference
|
||||
|
||||
| Variable | Required | Description |
|
||||
|-------------------|----------|----------------------------------------------------|
|
||||
| `POSTGRES_USER` | Yes | PostgreSQL username |
|
||||
| `POSTGRES_PASSWORD`| Yes | PostgreSQL password (**change from default**) |
|
||||
| `POSTGRES_DB` | Yes | Database name |
|
||||
| `DATABASE_URL` | Yes | Full connection string for the backend |
|
||||
| `REDIS_URL` | Yes | Redis connection string |
|
||||
| `JWT_SECRET` | Yes | Secret for signing JWT tokens (**change from default**) |
|
||||
| `NODE_ENV` | Yes | `development` or `production` |
|
||||
| `AI_API_URL` | Yes | OpenAI-compatible inference endpoint |
|
||||
| `AI_API_KEY` | Yes | API key for AI provider (Nvidia) |
|
||||
| `AI_MODEL` | Yes | Model identifier for AI calls |
|
||||
| `AI_DEBUG` | No | Set `true` to log raw AI prompts/responses |
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
Development:
|
||||
┌──────────────────┐
|
||||
Browser ─────────► │ nginx :80 │
|
||||
└────────┬─────────┘
|
||||
┌──────────┴──────────┐
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ backend :3000│ │frontend :5173│
|
||||
│ (NestJS) │ │ (Vite/React) │
|
||||
└──────┬───────┘ └──────────────┘
|
||||
┌────┴────┐
|
||||
▼ ▼
|
||||
┌────────────┐ ┌───────────┐
|
||||
│postgres:5432│ │redis :6379│
|
||||
│ (PG 15) │ │ (Redis 7) │
|
||||
└────────────┘ └───────────┘
|
||||
|
||||
Production (host nginx handles SSL + routing):
|
||||
┌────────────────────────────────┐
|
||||
Browser ─────────► │ Host nginx :80/:443 (SSL) │
|
||||
│ /api/* → 127.0.0.1:3000 │
|
||||
│ /* → 127.0.0.1:3001 │
|
||||
└────────┬───────────┬───────────┘
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ backend :3000│ │frontend :3001│
|
||||
│ (compiled) │ │ (static nginx)│
|
||||
└──────┬───────┘ └──────────────┘
|
||||
┌────┴────┐
|
||||
▼ ▼
|
||||
┌────────────┐ ┌───────────┐
|
||||
│postgres:5432│ │redis :6379│
|
||||
│ (PG 15) │ │ (Redis 7) │
|
||||
└────────────┘ └───────────┘
|
||||
```
|
||||
|
||||
**Multi-tenant isolation:** Each HOA organization gets its own PostgreSQL
|
||||
schema (e.g., `tenant_pine_creek_hoa_q33i`). The `shared` schema holds
|
||||
cross-tenant tables (users, organizations, market rates). Tenant context
|
||||
is resolved from the JWT token on every API request.
|
||||
532
docs/SCALING.md
532
docs/SCALING.md
@@ -1,532 +0,0 @@
|
||||
# HOA LedgerIQ — Scaling Guide
|
||||
|
||||
**Version:** 2026.3.2 (beta)
|
||||
**Last updated:** 2026-03-03
|
||||
**Current infrastructure:** 4 ARM cores, 24 GB RAM, single VM
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Current Architecture Baseline](#current-architecture-baseline)
|
||||
2. [Resource Budget — Where Your 24 GB Goes](#resource-budget--where-your-24-gb-goes)
|
||||
3. [Scaling Signals — When to Act](#scaling-signals--when-to-act)
|
||||
4. [Phase 1: Vertical Tuning (Same VM)](#phase-1-vertical-tuning-same-vm)
|
||||
5. [Phase 2: Offload Services (Managed DB + Cache)](#phase-2-offload-services-managed-db--cache)
|
||||
6. [Phase 3: Horizontal Scaling (Multiple Backend Instances)](#phase-3-horizontal-scaling-multiple-backend-instances)
|
||||
7. [Phase 4: Full Horizontal (Multi-Node)](#phase-4-full-horizontal-multi-node)
|
||||
8. [Component-by-Component Scaling Reference](#component-by-component-scaling-reference)
|
||||
9. [Docker Daemon Tuning](#docker-daemon-tuning)
|
||||
10. [Monitoring with New Relic](#monitoring-with-new-relic)
|
||||
|
||||
---
|
||||
|
||||
## Current Architecture Baseline
|
||||
|
||||
```
|
||||
Internet
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Host VM (4 ARM cores, 24 GB RAM) │
|
||||
│ │
|
||||
│ ┌──────────────────────────────────┐ │
|
||||
│ │ Host nginx :80/:443 (SSL) │ │
|
||||
│ │ /api/* → 127.0.0.1:3000 │ │
|
||||
│ │ /* → 127.0.0.1:3001 │ │
|
||||
│ └──────────┬───────────┬──────────┘ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ Docker (hoanet) │
|
||||
│ │ backend :3000│ │frontend :3001│ │
|
||||
│ │ 4 workers │ │ static nginx │ │
|
||||
│ │ 1024 MB cap │ │ ~5 MB used │ │
|
||||
│ └──────┬───────┘ └──────────────┘ │
|
||||
│ ┌────┴────┐ │
|
||||
│ ▼ ▼ │
|
||||
│ ┌────────────┐ ┌───────────┐ │
|
||||
│ │postgres │ │redis │ │
|
||||
│ │ 1024 MB cap│ │ 256 MB cap│ │
|
||||
│ └────────────┘ └───────────┘ │
|
||||
└─────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**How requests flow:**
|
||||
|
||||
1. Browser hits host nginx (SSL termination, rate limiting)
|
||||
2. API requests proxy to the NestJS backend (4 clustered workers)
|
||||
3. Static asset requests proxy to the frontend nginx container
|
||||
4. Backend queries PostgreSQL and Redis over the Docker bridge network
|
||||
5. All inter-container traffic stays on the `hoanet` bridge (kernel-routed, no userland proxy)
|
||||
|
||||
**Key configuration facts:**
|
||||
|
||||
| Component | Current config | Bottleneck at scale |
|
||||
|-----------|---------------|---------------------|
|
||||
| Backend | 4 Node.js workers (1 per core) | CPU-bound under heavy API load |
|
||||
| PostgreSQL | 200 max connections, 256 MB shared_buffers | Connection count, then memory |
|
||||
| Redis | 256 MB maxmemory, LRU eviction | Memory, then network |
|
||||
| Frontend | Static nginx, ~5 MB memory | Effectively unlimited for static serving |
|
||||
| Host nginx | Rate limit: 10 req/s per IP, burst 30 | File descriptors, worker connections |
|
||||
|
||||
---
|
||||
|
||||
## Resource Budget — Where Your 24 GB Goes
|
||||
|
||||
| Component | Memory limit | Typical usage | Notes |
|
||||
|-----------|-------------|---------------|-------|
|
||||
| Backend | 1024 MB | 250–400 MB | 4 workers share one container limit |
|
||||
| PostgreSQL | 1024 MB | 50–300 MB | Grows with active queries and shared_buffers |
|
||||
| Redis | 256 MB | 3–10 MB | Very low until caching is heavily used |
|
||||
| Frontend | None set | ~5 MB | Static nginx, negligible |
|
||||
| Host nginx | N/A (host) | ~10 MB | Runs on the host, not in Docker |
|
||||
| New Relic agent | (inside backend) | ~30–50 MB | Included in backend memory |
|
||||
| **Total reserved** | **~2.3 GB** | **~500 MB idle** | **~21.5 GB available for growth** |
|
||||
|
||||
You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed.
|
||||
|
||||
---
|
||||
|
||||
## Scaling Signals — When to Act
|
||||
|
||||
Use these thresholds from New Relic and system metrics to decide when to scale:
|
||||
|
||||
### Immediate action required
|
||||
|
||||
| Signal | Threshold | Likely bottleneck |
|
||||
|--------|-----------|-------------------|
|
||||
| API response time (p95) | > 2 seconds | Backend CPU or DB queries |
|
||||
| Error rate | > 1% of requests | Backend memory, DB connections, or bugs |
|
||||
| PostgreSQL connection wait time | > 100 ms | Connection pool exhaustion |
|
||||
| Container OOM kills | Any occurrence | Memory limit too low |
|
||||
|
||||
### Plan scaling within 2–4 weeks
|
||||
|
||||
| Signal | Threshold | Likely bottleneck |
|
||||
|--------|-----------|-------------------|
|
||||
| API response time (p95) | > 500 ms sustained | Backend approaching CPU saturation |
|
||||
| Backend CPU (container) | > 80% sustained | Need more workers or replicas |
|
||||
| PostgreSQL CPU | > 70% sustained | Query optimization or read replicas |
|
||||
| PostgreSQL connections | > 150 of 200 | Pool size or connection leaks |
|
||||
| Redis memory | > 200 MB of 256 MB | Increase limit or review eviction |
|
||||
| Host disk usage | > 80% | Postgres WAL or Docker image bloat |
|
||||
|
||||
### No action needed
|
||||
|
||||
| Signal | Range | Meaning |
|
||||
|--------|-------|---------|
|
||||
| Backend CPU | < 50% | Normal headroom |
|
||||
| API response time (p95) | < 200 ms | Healthy |
|
||||
| PostgreSQL connections | < 100 | Plenty of capacity |
|
||||
| Memory usage (all containers) | < 60% of limits | Well-sized |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Vertical Tuning (Same VM)
|
||||
|
||||
**When:** 50–200 concurrent users, response times starting to climb.
|
||||
**Cost:** Free — just configuration changes.
|
||||
|
||||
### 1.1 Increase backend memory limit
|
||||
|
||||
The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses
|
||||
60–100 MB at baseline. Under load with New Relic active, they can reach
|
||||
150 MB each (600 MB total). Raise the limit to give headroom:
|
||||
|
||||
```yaml
|
||||
# docker-compose.prod.yml
|
||||
backend:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 2048M # was 1024M
|
||||
reservations:
|
||||
memory: 512M # was 256M
|
||||
```
|
||||
|
||||
### 1.2 Tune PostgreSQL for available RAM
|
||||
|
||||
With 24 GB on the host, PostgreSQL can use significantly more memory. These
|
||||
settings assume PostgreSQL is the only memory-heavy workload besides the
|
||||
backend:
|
||||
|
||||
```yaml
|
||||
# docker-compose.prod.yml
|
||||
postgres:
|
||||
command: >
|
||||
postgres
|
||||
-c max_connections=200
|
||||
-c shared_buffers=1GB # was 256MB (25% of 4GB rule of thumb)
|
||||
-c effective_cache_size=4GB # was 512MB (OS page cache estimate)
|
||||
-c work_mem=16MB # was 4MB (per-sort memory)
|
||||
-c maintenance_work_mem=256MB # was 64MB (VACUUM, CREATE INDEX)
|
||||
-c checkpoint_completion_target=0.9
|
||||
-c wal_buffers=64MB # was 16MB
|
||||
-c random_page_cost=1.1
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 4096M # was 1024M
|
||||
reservations:
|
||||
memory: 1024M # was 512M
|
||||
```
|
||||
|
||||
### 1.3 Increase Redis memory
|
||||
|
||||
If you start using Redis for session storage or response caching:
|
||||
|
||||
```yaml
|
||||
# docker-compose.prod.yml
|
||||
redis:
|
||||
command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru
|
||||
```
|
||||
|
||||
### 1.4 Tune host nginx worker connections
|
||||
|
||||
```nginx
|
||||
# /etc/nginx/nginx.conf (host)
|
||||
worker_processes auto; # matches CPU cores (4)
|
||||
events {
|
||||
worker_connections 2048; # default is often 768
|
||||
multi_accept on;
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 1 capacity estimate
|
||||
|
||||
| Metric | Estimate |
|
||||
|--------|----------|
|
||||
| Concurrent users | 200–500 |
|
||||
| API requests/sec | 400–800 |
|
||||
| Tenants | 50–100 |
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Offload Services (Managed DB + Cache)
|
||||
|
||||
**When:** 500+ concurrent users, or you need high availability / automated backups.
|
||||
**Cost:** $50–200/month depending on provider and tier.
|
||||
|
||||
### 2.1 Move PostgreSQL to a managed service
|
||||
|
||||
Replace the Docker PostgreSQL container with a managed instance:
|
||||
- **AWS:** RDS for PostgreSQL (db.t4g.medium — 2 vCPU, 4 GB, ~$70/mo)
|
||||
- **GCP:** Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo)
|
||||
- **DigitalOcean:** Managed Databases ($60/mo for 2 vCPU / 4 GB)
|
||||
|
||||
**Changes required:**
|
||||
|
||||
1. Update `.env` to point `DATABASE_URL` at the managed instance
|
||||
2. In `docker-compose.prod.yml`, disable the postgres container:
|
||||
```yaml
|
||||
postgres:
|
||||
deploy:
|
||||
replicas: 0
|
||||
```
|
||||
3. Remove the `depends_on: postgres` from the backend service
|
||||
4. Ensure the managed DB allows connections from your VM's IP
|
||||
|
||||
**Benefits:** Automated backups, point-in-time recovery, read replicas,
|
||||
automatic failover, no memory/CPU contention with the application.
|
||||
|
||||
### 2.2 Move Redis to a managed service
|
||||
|
||||
Replace the Docker Redis container similarly:
|
||||
- **AWS:** ElastiCache (cache.t4g.micro, ~$15/mo)
|
||||
- **DigitalOcean:** Managed Redis ($15/mo)
|
||||
|
||||
Update `REDIS_URL` in `.env` and disable the container.
|
||||
|
||||
### Phase 2 resource reclaim
|
||||
|
||||
Offloading DB and cache frees ~5 GB of reserved memory on the VM,
|
||||
leaving the full 24 GB available for backend scaling (Phase 3).
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Horizontal Scaling (Multiple Backend Instances)
|
||||
|
||||
**When:** Single backend container hits CPU ceiling (4 workers maxed),
|
||||
or you need zero-downtime deployments.
|
||||
|
||||
### 3.1 Run multiple backend replicas with Docker Compose
|
||||
|
||||
```yaml
|
||||
# docker-compose.prod.yml
|
||||
backend:
|
||||
deploy:
|
||||
replicas: 2 # 2 containers × 4 workers = 8 workers
|
||||
resources:
|
||||
limits:
|
||||
memory: 2048M
|
||||
reservations:
|
||||
memory: 512M
|
||||
```
|
||||
|
||||
**Important:** With replicas > 1 you cannot use `ports:` directly.
|
||||
Switch the host nginx upstream to use Docker's internal DNS:
|
||||
|
||||
```nginx
|
||||
# /etc/nginx/sites-available/your-site
|
||||
upstream backend {
|
||||
# Docker Compose assigns container IPs dynamically.
|
||||
# Use a resolver to look up the service name.
|
||||
server 127.0.0.1:3000;
|
||||
server 127.0.0.1:3010; # second replica on different host port
|
||||
}
|
||||
```
|
||||
|
||||
Alternatively, use Docker Compose port ranges:
|
||||
|
||||
```yaml
|
||||
backend:
|
||||
ports:
|
||||
- "127.0.0.1:3000-3009:3000"
|
||||
deploy:
|
||||
replicas: 2
|
||||
```
|
||||
|
||||
### 3.2 Connection pool considerations
|
||||
|
||||
Each backend container runs up to 4 workers, each with its own connection
|
||||
pool. With the default pool size of 30:
|
||||
|
||||
| Replicas | Workers | Max DB connections |
|
||||
|----------|---------|-------------------|
|
||||
| 1 | 4 | 120 |
|
||||
| 2 | 8 | 240 |
|
||||
| 3 | 12 | 360 |
|
||||
|
||||
If using managed PostgreSQL, ensure `max_connections` on the DB is high
|
||||
enough. For > 2 replicas, consider adding **PgBouncer** as a connection
|
||||
pooler (transaction-mode pooling) to multiplex connections:
|
||||
|
||||
```
|
||||
Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL
|
||||
```
|
||||
|
||||
### 3.3 Session and state considerations
|
||||
|
||||
The application currently uses **stateless JWT authentication** — no
|
||||
server-side sessions. This means backend replicas can handle any request
|
||||
without sticky sessions. Redis is used for caching only. This architecture
|
||||
is already horizontal-ready.
|
||||
|
||||
### Phase 3 capacity estimate
|
||||
|
||||
| Replicas | Concurrent users | API req/sec |
|
||||
|----------|-----------------|-------------|
|
||||
| 2 | 500–1,000 | 800–1,500 |
|
||||
| 3 | 1,000–2,000 | 1,500–2,500 |
|
||||
|
||||
---
|
||||
|
||||
## Phase 4: Full Horizontal (Multi-Node)
|
||||
|
||||
**When:** Single VM resources exhausted, or you need geographic distribution
|
||||
and high availability.
|
||||
|
||||
### 4.1 Docker Swarm (simplest multi-node)
|
||||
|
||||
Docker Swarm is the easiest migration from Docker Compose. The compose
|
||||
files are already compatible:
|
||||
|
||||
```bash
|
||||
# On the manager node
|
||||
docker swarm init
|
||||
|
||||
# On worker nodes
|
||||
docker swarm join --token <token> <manager-ip>:2377
|
||||
|
||||
# Deploy the stack
|
||||
docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq
|
||||
```
|
||||
|
||||
Scale the backend across nodes:
|
||||
|
||||
```bash
|
||||
docker service scale hoaledgeriq_backend=4
|
||||
```
|
||||
|
||||
Swarm handles load balancing across nodes via its built-in ingress network.
|
||||
|
||||
### 4.2 Kubernetes (full orchestration)
|
||||
|
||||
For larger deployments, migrate to Kubernetes:
|
||||
|
||||
- **Backend:** Deployment with HPA (Horizontal Pod Autoscaler) on CPU
|
||||
- **Frontend:** Deployment with 2+ replicas behind a Service
|
||||
- **PostgreSQL:** External managed service (not in the cluster)
|
||||
- **Redis:** External managed service or StatefulSet
|
||||
- **Ingress:** nginx-ingress or cloud load balancer
|
||||
|
||||
This is a significant migration but provides auto-scaling, self-healing,
|
||||
rolling deployments, and multi-region capability.
|
||||
|
||||
### 4.3 CDN for static assets
|
||||
|
||||
At any point in the scaling journey, a CDN provides the biggest return on
|
||||
investment for frontend performance:
|
||||
|
||||
- **Cloudflare** (free tier works): Proxy DNS, caches static assets at edge
|
||||
- **AWS CloudFront** or **GCP Cloud CDN**: More control, ~$0.085/GB
|
||||
|
||||
This eliminates nearly all load on the frontend nginx container and reduces
|
||||
latency for geographically distributed users. Static assets (JS, CSS,
|
||||
images) are served from edge nodes instead of your VM.
|
||||
|
||||
---
|
||||
|
||||
## Component-by-Component Scaling Reference
|
||||
|
||||
### Backend (NestJS)
|
||||
|
||||
| Approach | When | How |
|
||||
|----------|------|-----|
|
||||
| Tune worker count | CPU underused | Set `WORKERS` env var or modify `main.ts` cap |
|
||||
| Increase memory limit | OOM or >80% usage | Raise `deploy.resources.limits.memory` |
|
||||
| Add replicas | CPU maxed at 4 workers | `deploy.replicas: N` in compose |
|
||||
| Move to separate VM | VM resources exhausted | Run backend on dedicated compute |
|
||||
|
||||
**Current clustering logic** (from `backend/src/main.ts`):
|
||||
- Production: `Math.min(os.cpus().length, 4)` workers
|
||||
- Development: 1 worker
|
||||
- To allow more than 4 workers, change the cap in `main.ts`
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
| Approach | When | How |
|
||||
|----------|------|-----|
|
||||
| Increase shared_buffers | Cache hit ratio < 99% | Tune postgres command args |
|
||||
| Increase max_connections | Pool exhaustion errors | Increase in postgres command + add PgBouncer |
|
||||
| Add read replica | Read-heavy workload | Managed DB feature or streaming replication |
|
||||
| Vertical scale | Query latency high | Larger managed DB instance |
|
||||
|
||||
**Key queries to monitor:**
|
||||
```sql
|
||||
-- Connection usage
|
||||
SELECT count(*) AS active, max_conn FROM pg_stat_activity,
|
||||
(SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s
|
||||
GROUP BY max_conn;
|
||||
|
||||
-- Cache hit ratio (should be > 99%)
|
||||
SELECT
|
||||
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
|
||||
FROM pg_statio_user_tables;
|
||||
|
||||
-- Slow queries (if pg_stat_statements is enabled)
|
||||
SELECT query, mean_exec_time, calls
|
||||
FROM pg_stat_statements
|
||||
ORDER BY mean_exec_time DESC
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
| Approach | When | How |
|
||||
|----------|------|-----|
|
||||
| Increase maxmemory | Evictions happening frequently | Change `--maxmemory` in compose command |
|
||||
| Move to managed | Need persistence guarantees | AWS ElastiCache / DigitalOcean Managed Redis |
|
||||
| Add replica | Read-heavy caching | Managed service with read replicas |
|
||||
|
||||
### Host Nginx
|
||||
|
||||
| Approach | When | How |
|
||||
|----------|------|-----|
|
||||
| Tune worker_connections | Connection refused errors | Increase in `/etc/nginx/nginx.conf` |
|
||||
| Add upstream servers | Multiple backend replicas | upstream block with multiple servers |
|
||||
| Move to load balancer | Multi-node deployment | Cloud LB (ALB, GCP LB) or HAProxy |
|
||||
| Add CDN | Static asset latency | Cloudflare, CloudFront, etc. |
|
||||
|
||||
---
|
||||
|
||||
## Docker Daemon Tuning
|
||||
|
||||
These settings are applied on the host in `/etc/docker/daemon.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"userland-proxy": false,
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "50m",
|
||||
"max-file": "3"
|
||||
},
|
||||
"default-ulimits": {
|
||||
"nofile": {
|
||||
"Name": "nofile",
|
||||
"Hard": 65536,
|
||||
"Soft": 65536
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Setting | Purpose |
|
||||
|---------|---------|
|
||||
| `userland-proxy: false` | Kernel-level port forwarding instead of userspace Go proxy (already applied) |
|
||||
| `log-opts` | Prevents Docker container logs from filling the disk |
|
||||
| `default-ulimits.nofile` | Raises file descriptor limit for containers handling many connections |
|
||||
|
||||
After changing, restart Docker: `sudo systemctl restart docker`
|
||||
|
||||
---
|
||||
|
||||
## Monitoring with New Relic
|
||||
|
||||
New Relic is deployed on the backend via the conditional preload
|
||||
(`NEW_RELIC_ENABLED=true` in `.env`). Key dashboards to set up:
|
||||
|
||||
### Alerts to configure
|
||||
|
||||
| Alert | Condition | Priority |
|
||||
|-------|-----------|----------|
|
||||
| High error rate | > 1% for 5 minutes | Critical |
|
||||
| Slow transactions | p95 > 2s for 5 minutes | Critical |
|
||||
| Apdex score drop | < 0.7 for 10 minutes | Warning |
|
||||
| Memory usage | > 80% of container limit for 10 minutes | Warning |
|
||||
| Transaction throughput drop | > 50% decrease vs. baseline | Warning |
|
||||
|
||||
### Key transactions to monitor
|
||||
|
||||
| Endpoint | Why |
|
||||
|----------|-----|
|
||||
| `POST /api/auth/login` | Authentication performance, first thing every user hits |
|
||||
| `GET /api/journal-entries` | Heaviest read query (double-entry bookkeeping with lines) |
|
||||
| `POST /api/investment-planning/recommendations` | AI endpoint, 30–180s response time, external dependency |
|
||||
| `GET /api/reports/*` | Financial reports with aggregate queries |
|
||||
| `GET /api/projects` | Includes real-time funding computation across all reserve projects |
|
||||
|
||||
### Infrastructure metrics to export
|
||||
|
||||
If you later add the New Relic Infrastructure agent to the host VM,
|
||||
you can correlate application performance with system metrics:
|
||||
|
||||
```bash
|
||||
# Install on the host (not in Docker)
|
||||
curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
|
||||
sudo NEW_RELIC_API_KEY=<your-key> NEW_RELIC_ACCOUNT_ID=<your-id> \
|
||||
/usr/local/bin/newrelic install -n infrastructure-agent-installer
|
||||
```
|
||||
|
||||
This provides host-level CPU, memory, disk, and network metrics alongside
|
||||
your application telemetry.
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference — Scaling Decision Tree
|
||||
|
||||
```
|
||||
Is API response time (p95) > 500ms?
|
||||
├── Yes → Is backend CPU > 80%?
|
||||
│ ├── Yes → Phase 1: Already at 4 workers?
|
||||
│ │ ├── Yes → Phase 3: Add backend replicas
|
||||
│ │ └── No → Raise worker cap in main.ts
|
||||
│ └── No → Is PostgreSQL slow?
|
||||
│ ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB
|
||||
│ └── No → Profile the slow endpoints in New Relic
|
||||
├── No → Is memory > 80% on any container?
|
||||
│ ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free)
|
||||
│ └── No → Is disk > 80%?
|
||||
│ ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation
|
||||
│ └── No → No scaling needed
|
||||
```
|
||||
Reference in New Issue
Block a user