From 3790a3bd9e74c806443415d1b0ba6ee233db6687 Mon Sep 17 00:00:00 2001 From: olsch01 Date: Tue, 3 Mar 2026 15:06:22 -0500 Subject: [PATCH] docs: add scaling guide for production infrastructure Covers vertical tuning, managed service offloading, horizontal scaling with replicas, and multi-node strategies. Includes resource budgets for the current 4-core/24GB VM, monitoring thresholds for New Relic alerts, PostgreSQL/Redis tuning values, and a scaling decision tree. Co-Authored-By: Claude Opus 4.6 --- docs/SCALING.md | 532 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 532 insertions(+) create mode 100644 docs/SCALING.md diff --git a/docs/SCALING.md b/docs/SCALING.md new file mode 100644 index 0000000..b0014dd --- /dev/null +++ b/docs/SCALING.md @@ -0,0 +1,532 @@ +# HOA LedgerIQ — Scaling Guide + +**Version:** 2026.3.2 (beta) +**Last updated:** 2026-03-03 +**Current infrastructure:** 4 ARM cores, 24 GB RAM, single VM + +--- + +## Table of Contents + +1. [Current Architecture Baseline](#current-architecture-baseline) +2. [Resource Budget — Where Your 24 GB Goes](#resource-budget--where-your-24-gb-goes) +3. [Scaling Signals — When to Act](#scaling-signals--when-to-act) +4. [Phase 1: Vertical Tuning (Same VM)](#phase-1-vertical-tuning-same-vm) +5. [Phase 2: Offload Services (Managed DB + Cache)](#phase-2-offload-services-managed-db--cache) +6. [Phase 3: Horizontal Scaling (Multiple Backend Instances)](#phase-3-horizontal-scaling-multiple-backend-instances) +7. [Phase 4: Full Horizontal (Multi-Node)](#phase-4-full-horizontal-multi-node) +8. [Component-by-Component Scaling Reference](#component-by-component-scaling-reference) +9. [Docker Daemon Tuning](#docker-daemon-tuning) +10. [Monitoring with New Relic](#monitoring-with-new-relic) + +--- + +## Current Architecture Baseline + +``` + Internet + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Host VM (4 ARM cores, 24 GB RAM) │ +│ │ +│ ┌──────────────────────────────────┐ │ +│ │ Host nginx :80/:443 (SSL) │ │ +│ │ /api/* → 127.0.0.1:3000 │ │ +│ │ /* → 127.0.0.1:3001 │ │ +│ └──────────┬───────────┬──────────┘ │ +│ ▼ ▼ │ +│ ┌──────────────┐ ┌──────────────┐ Docker (hoanet) │ +│ │ backend :3000│ │frontend :3001│ │ +│ │ 4 workers │ │ static nginx │ │ +│ │ 1024 MB cap │ │ ~5 MB used │ │ +│ └──────┬───────┘ └──────────────┘ │ +│ ┌────┴────┐ │ +│ ▼ ▼ │ +│ ┌────────────┐ ┌───────────┐ │ +│ │postgres │ │redis │ │ +│ │ 1024 MB cap│ │ 256 MB cap│ │ +│ └────────────┘ └───────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +**How requests flow:** + +1. Browser hits host nginx (SSL termination, rate limiting) +2. API requests proxy to the NestJS backend (4 clustered workers) +3. Static asset requests proxy to the frontend nginx container +4. Backend queries PostgreSQL and Redis over the Docker bridge network +5. All inter-container traffic stays on the `hoanet` bridge (kernel-routed, no userland proxy) + +**Key configuration facts:** + +| Component | Current config | Bottleneck at scale | +|-----------|---------------|---------------------| +| Backend | 4 Node.js workers (1 per core) | CPU-bound under heavy API load | +| PostgreSQL | 200 max connections, 256 MB shared_buffers | Connection count, then memory | +| Redis | 256 MB maxmemory, LRU eviction | Memory, then network | +| Frontend | Static nginx, ~5 MB memory | Effectively unlimited for static serving | +| Host nginx | Rate limit: 10 req/s per IP, burst 30 | File descriptors, worker connections | + +--- + +## Resource Budget — Where Your 24 GB Goes + +| Component | Memory limit | Typical usage | Notes | +|-----------|-------------|---------------|-------| +| Backend | 1024 MB | 250–400 MB | 4 workers share one container limit | +| PostgreSQL | 1024 MB | 50–300 MB | Grows with active queries and shared_buffers | +| Redis | 256 MB | 3–10 MB | Very low until caching is heavily used | +| Frontend | None set | ~5 MB | Static nginx, negligible | +| Host nginx | N/A (host) | ~10 MB | Runs on the host, not in Docker | +| New Relic agent | (inside backend) | ~30–50 MB | Included in backend memory | +| **Total reserved** | **~2.3 GB** | **~500 MB idle** | **~21.5 GB available for growth** | + +You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed. + +--- + +## Scaling Signals — When to Act + +Use these thresholds from New Relic and system metrics to decide when to scale: + +### Immediate action required + +| Signal | Threshold | Likely bottleneck | +|--------|-----------|-------------------| +| API response time (p95) | > 2 seconds | Backend CPU or DB queries | +| Error rate | > 1% of requests | Backend memory, DB connections, or bugs | +| PostgreSQL connection wait time | > 100 ms | Connection pool exhaustion | +| Container OOM kills | Any occurrence | Memory limit too low | + +### Plan scaling within 2–4 weeks + +| Signal | Threshold | Likely bottleneck | +|--------|-----------|-------------------| +| API response time (p95) | > 500 ms sustained | Backend approaching CPU saturation | +| Backend CPU (container) | > 80% sustained | Need more workers or replicas | +| PostgreSQL CPU | > 70% sustained | Query optimization or read replicas | +| PostgreSQL connections | > 150 of 200 | Pool size or connection leaks | +| Redis memory | > 200 MB of 256 MB | Increase limit or review eviction | +| Host disk usage | > 80% | Postgres WAL or Docker image bloat | + +### No action needed + +| Signal | Range | Meaning | +|--------|-------|---------| +| Backend CPU | < 50% | Normal headroom | +| API response time (p95) | < 200 ms | Healthy | +| PostgreSQL connections | < 100 | Plenty of capacity | +| Memory usage (all containers) | < 60% of limits | Well-sized | + +--- + +## Phase 1: Vertical Tuning (Same VM) + +**When:** 50–200 concurrent users, response times starting to climb. +**Cost:** Free — just configuration changes. + +### 1.1 Increase backend memory limit + +The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses +60–100 MB at baseline. Under load with New Relic active, they can reach +150 MB each (600 MB total). Raise the limit to give headroom: + +```yaml +# docker-compose.prod.yml +backend: + deploy: + resources: + limits: + memory: 2048M # was 1024M + reservations: + memory: 512M # was 256M +``` + +### 1.2 Tune PostgreSQL for available RAM + +With 24 GB on the host, PostgreSQL can use significantly more memory. These +settings assume PostgreSQL is the only memory-heavy workload besides the +backend: + +```yaml +# docker-compose.prod.yml +postgres: + command: > + postgres + -c max_connections=200 + -c shared_buffers=1GB # was 256MB (25% of 4GB rule of thumb) + -c effective_cache_size=4GB # was 512MB (OS page cache estimate) + -c work_mem=16MB # was 4MB (per-sort memory) + -c maintenance_work_mem=256MB # was 64MB (VACUUM, CREATE INDEX) + -c checkpoint_completion_target=0.9 + -c wal_buffers=64MB # was 16MB + -c random_page_cost=1.1 + deploy: + resources: + limits: + memory: 4096M # was 1024M + reservations: + memory: 1024M # was 512M +``` + +### 1.3 Increase Redis memory + +If you start using Redis for session storage or response caching: + +```yaml +# docker-compose.prod.yml +redis: + command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru +``` + +### 1.4 Tune host nginx worker connections + +```nginx +# /etc/nginx/nginx.conf (host) +worker_processes auto; # matches CPU cores (4) +events { + worker_connections 2048; # default is often 768 + multi_accept on; +} +``` + +### Phase 1 capacity estimate + +| Metric | Estimate | +|--------|----------| +| Concurrent users | 200–500 | +| API requests/sec | 400–800 | +| Tenants | 50–100 | + +--- + +## Phase 2: Offload Services (Managed DB + Cache) + +**When:** 500+ concurrent users, or you need high availability / automated backups. +**Cost:** $50–200/month depending on provider and tier. + +### 2.1 Move PostgreSQL to a managed service + +Replace the Docker PostgreSQL container with a managed instance: +- **AWS:** RDS for PostgreSQL (db.t4g.medium — 2 vCPU, 4 GB, ~$70/mo) +- **GCP:** Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo) +- **DigitalOcean:** Managed Databases ($60/mo for 2 vCPU / 4 GB) + +**Changes required:** + +1. Update `.env` to point `DATABASE_URL` at the managed instance +2. In `docker-compose.prod.yml`, disable the postgres container: + ```yaml + postgres: + deploy: + replicas: 0 + ``` +3. Remove the `depends_on: postgres` from the backend service +4. Ensure the managed DB allows connections from your VM's IP + +**Benefits:** Automated backups, point-in-time recovery, read replicas, +automatic failover, no memory/CPU contention with the application. + +### 2.2 Move Redis to a managed service + +Replace the Docker Redis container similarly: +- **AWS:** ElastiCache (cache.t4g.micro, ~$15/mo) +- **DigitalOcean:** Managed Redis ($15/mo) + +Update `REDIS_URL` in `.env` and disable the container. + +### Phase 2 resource reclaim + +Offloading DB and cache frees ~5 GB of reserved memory on the VM, +leaving the full 24 GB available for backend scaling (Phase 3). + +--- + +## Phase 3: Horizontal Scaling (Multiple Backend Instances) + +**When:** Single backend container hits CPU ceiling (4 workers maxed), +or you need zero-downtime deployments. + +### 3.1 Run multiple backend replicas with Docker Compose + +```yaml +# docker-compose.prod.yml +backend: + deploy: + replicas: 2 # 2 containers × 4 workers = 8 workers + resources: + limits: + memory: 2048M + reservations: + memory: 512M +``` + +**Important:** With replicas > 1 you cannot use `ports:` directly. +Switch the host nginx upstream to use Docker's internal DNS: + +```nginx +# /etc/nginx/sites-available/your-site +upstream backend { + # Docker Compose assigns container IPs dynamically. + # Use a resolver to look up the service name. + server 127.0.0.1:3000; + server 127.0.0.1:3010; # second replica on different host port +} +``` + +Alternatively, use Docker Compose port ranges: + +```yaml +backend: + ports: + - "127.0.0.1:3000-3009:3000" + deploy: + replicas: 2 +``` + +### 3.2 Connection pool considerations + +Each backend container runs up to 4 workers, each with its own connection +pool. With the default pool size of 30: + +| Replicas | Workers | Max DB connections | +|----------|---------|-------------------| +| 1 | 4 | 120 | +| 2 | 8 | 240 | +| 3 | 12 | 360 | + +If using managed PostgreSQL, ensure `max_connections` on the DB is high +enough. For > 2 replicas, consider adding **PgBouncer** as a connection +pooler (transaction-mode pooling) to multiplex connections: + +``` +Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL +``` + +### 3.3 Session and state considerations + +The application currently uses **stateless JWT authentication** — no +server-side sessions. This means backend replicas can handle any request +without sticky sessions. Redis is used for caching only. This architecture +is already horizontal-ready. + +### Phase 3 capacity estimate + +| Replicas | Concurrent users | API req/sec | +|----------|-----------------|-------------| +| 2 | 500–1,000 | 800–1,500 | +| 3 | 1,000–2,000 | 1,500–2,500 | + +--- + +## Phase 4: Full Horizontal (Multi-Node) + +**When:** Single VM resources exhausted, or you need geographic distribution +and high availability. + +### 4.1 Docker Swarm (simplest multi-node) + +Docker Swarm is the easiest migration from Docker Compose. The compose +files are already compatible: + +```bash +# On the manager node +docker swarm init + +# On worker nodes +docker swarm join --token :2377 + +# Deploy the stack +docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq +``` + +Scale the backend across nodes: + +```bash +docker service scale hoaledgeriq_backend=4 +``` + +Swarm handles load balancing across nodes via its built-in ingress network. + +### 4.2 Kubernetes (full orchestration) + +For larger deployments, migrate to Kubernetes: + +- **Backend:** Deployment with HPA (Horizontal Pod Autoscaler) on CPU +- **Frontend:** Deployment with 2+ replicas behind a Service +- **PostgreSQL:** External managed service (not in the cluster) +- **Redis:** External managed service or StatefulSet +- **Ingress:** nginx-ingress or cloud load balancer + +This is a significant migration but provides auto-scaling, self-healing, +rolling deployments, and multi-region capability. + +### 4.3 CDN for static assets + +At any point in the scaling journey, a CDN provides the biggest return on +investment for frontend performance: + +- **Cloudflare** (free tier works): Proxy DNS, caches static assets at edge +- **AWS CloudFront** or **GCP Cloud CDN**: More control, ~$0.085/GB + +This eliminates nearly all load on the frontend nginx container and reduces +latency for geographically distributed users. Static assets (JS, CSS, +images) are served from edge nodes instead of your VM. + +--- + +## Component-by-Component Scaling Reference + +### Backend (NestJS) + +| Approach | When | How | +|----------|------|-----| +| Tune worker count | CPU underused | Set `WORKERS` env var or modify `main.ts` cap | +| Increase memory limit | OOM or >80% usage | Raise `deploy.resources.limits.memory` | +| Add replicas | CPU maxed at 4 workers | `deploy.replicas: N` in compose | +| Move to separate VM | VM resources exhausted | Run backend on dedicated compute | + +**Current clustering logic** (from `backend/src/main.ts`): +- Production: `Math.min(os.cpus().length, 4)` workers +- Development: 1 worker +- To allow more than 4 workers, change the cap in `main.ts` + +### PostgreSQL + +| Approach | When | How | +|----------|------|-----| +| Increase shared_buffers | Cache hit ratio < 99% | Tune postgres command args | +| Increase max_connections | Pool exhaustion errors | Increase in postgres command + add PgBouncer | +| Add read replica | Read-heavy workload | Managed DB feature or streaming replication | +| Vertical scale | Query latency high | Larger managed DB instance | + +**Key queries to monitor:** +```sql +-- Connection usage +SELECT count(*) AS active, max_conn FROM pg_stat_activity, + (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s +GROUP BY max_conn; + +-- Cache hit ratio (should be > 99%) +SELECT + sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio +FROM pg_statio_user_tables; + +-- Slow queries (if pg_stat_statements is enabled) +SELECT query, mean_exec_time, calls +FROM pg_stat_statements +ORDER BY mean_exec_time DESC +LIMIT 10; +``` + +### Redis + +| Approach | When | How | +|----------|------|-----| +| Increase maxmemory | Evictions happening frequently | Change `--maxmemory` in compose command | +| Move to managed | Need persistence guarantees | AWS ElastiCache / DigitalOcean Managed Redis | +| Add replica | Read-heavy caching | Managed service with read replicas | + +### Host Nginx + +| Approach | When | How | +|----------|------|-----| +| Tune worker_connections | Connection refused errors | Increase in `/etc/nginx/nginx.conf` | +| Add upstream servers | Multiple backend replicas | upstream block with multiple servers | +| Move to load balancer | Multi-node deployment | Cloud LB (ALB, GCP LB) or HAProxy | +| Add CDN | Static asset latency | Cloudflare, CloudFront, etc. | + +--- + +## Docker Daemon Tuning + +These settings are applied on the host in `/etc/docker/daemon.json`: + +```json +{ + "userland-proxy": false, + "log-driver": "json-file", + "log-opts": { + "max-size": "50m", + "max-file": "3" + }, + "default-ulimits": { + "nofile": { + "Name": "nofile", + "Hard": 65536, + "Soft": 65536 + } + } +} +``` + +| Setting | Purpose | +|---------|---------| +| `userland-proxy: false` | Kernel-level port forwarding instead of userspace Go proxy (already applied) | +| `log-opts` | Prevents Docker container logs from filling the disk | +| `default-ulimits.nofile` | Raises file descriptor limit for containers handling many connections | + +After changing, restart Docker: `sudo systemctl restart docker` + +--- + +## Monitoring with New Relic + +New Relic is deployed on the backend via the conditional preload +(`NEW_RELIC_ENABLED=true` in `.env`). Key dashboards to set up: + +### Alerts to configure + +| Alert | Condition | Priority | +|-------|-----------|----------| +| High error rate | > 1% for 5 minutes | Critical | +| Slow transactions | p95 > 2s for 5 minutes | Critical | +| Apdex score drop | < 0.7 for 10 minutes | Warning | +| Memory usage | > 80% of container limit for 10 minutes | Warning | +| Transaction throughput drop | > 50% decrease vs. baseline | Warning | + +### Key transactions to monitor + +| Endpoint | Why | +|----------|-----| +| `POST /api/auth/login` | Authentication performance, first thing every user hits | +| `GET /api/journal-entries` | Heaviest read query (double-entry bookkeeping with lines) | +| `POST /api/investment-planning/recommendations` | AI endpoint, 30–180s response time, external dependency | +| `GET /api/reports/*` | Financial reports with aggregate queries | +| `GET /api/projects` | Includes real-time funding computation across all reserve projects | + +### Infrastructure metrics to export + +If you later add the New Relic Infrastructure agent to the host VM, +you can correlate application performance with system metrics: + +```bash +# Install on the host (not in Docker) +curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash +sudo NEW_RELIC_API_KEY= NEW_RELIC_ACCOUNT_ID= \ + /usr/local/bin/newrelic install -n infrastructure-agent-installer +``` + +This provides host-level CPU, memory, disk, and network metrics alongside +your application telemetry. + +--- + +## Quick Reference — Scaling Decision Tree + +``` +Is API response time (p95) > 500ms? +├── Yes → Is backend CPU > 80%? +│ ├── Yes → Phase 1: Already at 4 workers? +│ │ ├── Yes → Phase 3: Add backend replicas +│ │ └── No → Raise worker cap in main.ts +│ └── No → Is PostgreSQL slow? +│ ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB +│ └── No → Profile the slow endpoints in New Relic +├── No → Is memory > 80% on any container? +│ ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free) +│ └── No → Is disk > 80%? +│ ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation +│ └── No → No scaling needed +```