docs: add scaling guide for production infrastructure

Covers vertical tuning, managed service offloading, horizontal scaling
with replicas, and multi-node strategies. Includes resource budgets for
the current 4-core/24GB VM, monitoring thresholds for New Relic alerts,
PostgreSQL/Redis tuning values, and a scaling decision tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-03 15:06:22 -05:00
parent 0a07c61ca3
commit 3790a3bd9e

532
docs/SCALING.md Normal file
View File

@@ -0,0 +1,532 @@
# HOA LedgerIQ — Scaling Guide
**Version:** 2026.3.2 (beta)
**Last updated:** 2026-03-03
**Current infrastructure:** 4 ARM cores, 24 GB RAM, single VM
---
## Table of Contents
1. [Current Architecture Baseline](#current-architecture-baseline)
2. [Resource Budget — Where Your 24 GB Goes](#resource-budget--where-your-24-gb-goes)
3. [Scaling Signals — When to Act](#scaling-signals--when-to-act)
4. [Phase 1: Vertical Tuning (Same VM)](#phase-1-vertical-tuning-same-vm)
5. [Phase 2: Offload Services (Managed DB + Cache)](#phase-2-offload-services-managed-db--cache)
6. [Phase 3: Horizontal Scaling (Multiple Backend Instances)](#phase-3-horizontal-scaling-multiple-backend-instances)
7. [Phase 4: Full Horizontal (Multi-Node)](#phase-4-full-horizontal-multi-node)
8. [Component-by-Component Scaling Reference](#component-by-component-scaling-reference)
9. [Docker Daemon Tuning](#docker-daemon-tuning)
10. [Monitoring with New Relic](#monitoring-with-new-relic)
---
## Current Architecture Baseline
```
Internet
┌─────────────────────────────────────────────────────────┐
│ Host VM (4 ARM cores, 24 GB RAM) │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Host nginx :80/:443 (SSL) │ │
│ │ /api/* → 127.0.0.1:3000 │ │
│ │ /* → 127.0.0.1:3001 │ │
│ └──────────┬───────────┬──────────┘ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ Docker (hoanet) │
│ │ backend :3000│ │frontend :3001│ │
│ │ 4 workers │ │ static nginx │ │
│ │ 1024 MB cap │ │ ~5 MB used │ │
│ └──────┬───────┘ └──────────────┘ │
│ ┌────┴────┐ │
│ ▼ ▼ │
│ ┌────────────┐ ┌───────────┐ │
│ │postgres │ │redis │ │
│ │ 1024 MB cap│ │ 256 MB cap│ │
│ └────────────┘ └───────────┘ │
└─────────────────────────────────────────────────────────┘
```
**How requests flow:**
1. Browser hits host nginx (SSL termination, rate limiting)
2. API requests proxy to the NestJS backend (4 clustered workers)
3. Static asset requests proxy to the frontend nginx container
4. Backend queries PostgreSQL and Redis over the Docker bridge network
5. All inter-container traffic stays on the `hoanet` bridge (kernel-routed, no userland proxy)
**Key configuration facts:**
| Component | Current config | Bottleneck at scale |
|-----------|---------------|---------------------|
| Backend | 4 Node.js workers (1 per core) | CPU-bound under heavy API load |
| PostgreSQL | 200 max connections, 256 MB shared_buffers | Connection count, then memory |
| Redis | 256 MB maxmemory, LRU eviction | Memory, then network |
| Frontend | Static nginx, ~5 MB memory | Effectively unlimited for static serving |
| Host nginx | Rate limit: 10 req/s per IP, burst 30 | File descriptors, worker connections |
---
## Resource Budget — Where Your 24 GB Goes
| Component | Memory limit | Typical usage | Notes |
|-----------|-------------|---------------|-------|
| Backend | 1024 MB | 250400 MB | 4 workers share one container limit |
| PostgreSQL | 1024 MB | 50300 MB | Grows with active queries and shared_buffers |
| Redis | 256 MB | 310 MB | Very low until caching is heavily used |
| Frontend | None set | ~5 MB | Static nginx, negligible |
| Host nginx | N/A (host) | ~10 MB | Runs on the host, not in Docker |
| New Relic agent | (inside backend) | ~3050 MB | Included in backend memory |
| **Total reserved** | **~2.3 GB** | **~500 MB idle** | **~21.5 GB available for growth** |
You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed.
---
## Scaling Signals — When to Act
Use these thresholds from New Relic and system metrics to decide when to scale:
### Immediate action required
| Signal | Threshold | Likely bottleneck |
|--------|-----------|-------------------|
| API response time (p95) | > 2 seconds | Backend CPU or DB queries |
| Error rate | > 1% of requests | Backend memory, DB connections, or bugs |
| PostgreSQL connection wait time | > 100 ms | Connection pool exhaustion |
| Container OOM kills | Any occurrence | Memory limit too low |
### Plan scaling within 24 weeks
| Signal | Threshold | Likely bottleneck |
|--------|-----------|-------------------|
| API response time (p95) | > 500 ms sustained | Backend approaching CPU saturation |
| Backend CPU (container) | > 80% sustained | Need more workers or replicas |
| PostgreSQL CPU | > 70% sustained | Query optimization or read replicas |
| PostgreSQL connections | > 150 of 200 | Pool size or connection leaks |
| Redis memory | > 200 MB of 256 MB | Increase limit or review eviction |
| Host disk usage | > 80% | Postgres WAL or Docker image bloat |
### No action needed
| Signal | Range | Meaning |
|--------|-------|---------|
| Backend CPU | < 50% | Normal headroom |
| API response time (p95) | < 200 ms | Healthy |
| PostgreSQL connections | < 100 | Plenty of capacity |
| Memory usage (all containers) | < 60% of limits | Well-sized |
---
## Phase 1: Vertical Tuning (Same VM)
**When:** 50200 concurrent users, response times starting to climb.
**Cost:** Free just configuration changes.
### 1.1 Increase backend memory limit
The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses
60100 MB at baseline. Under load with New Relic active, they can reach
150 MB each (600 MB total). Raise the limit to give headroom:
```yaml
# docker-compose.prod.yml
backend:
deploy:
resources:
limits:
memory: 2048M # was 1024M
reservations:
memory: 512M # was 256M
```
### 1.2 Tune PostgreSQL for available RAM
With 24 GB on the host, PostgreSQL can use significantly more memory. These
settings assume PostgreSQL is the only memory-heavy workload besides the
backend:
```yaml
# docker-compose.prod.yml
postgres:
command: >
postgres
-c max_connections=200
-c shared_buffers=1GB # was 256MB (25% of 4GB rule of thumb)
-c effective_cache_size=4GB # was 512MB (OS page cache estimate)
-c work_mem=16MB # was 4MB (per-sort memory)
-c maintenance_work_mem=256MB # was 64MB (VACUUM, CREATE INDEX)
-c checkpoint_completion_target=0.9
-c wal_buffers=64MB # was 16MB
-c random_page_cost=1.1
deploy:
resources:
limits:
memory: 4096M # was 1024M
reservations:
memory: 1024M # was 512M
```
### 1.3 Increase Redis memory
If you start using Redis for session storage or response caching:
```yaml
# docker-compose.prod.yml
redis:
command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru
```
### 1.4 Tune host nginx worker connections
```nginx
# /etc/nginx/nginx.conf (host)
worker_processes auto; # matches CPU cores (4)
events {
worker_connections 2048; # default is often 768
multi_accept on;
}
```
### Phase 1 capacity estimate
| Metric | Estimate |
|--------|----------|
| Concurrent users | 200500 |
| API requests/sec | 400800 |
| Tenants | 50100 |
---
## Phase 2: Offload Services (Managed DB + Cache)
**When:** 500+ concurrent users, or you need high availability / automated backups.
**Cost:** $50200/month depending on provider and tier.
### 2.1 Move PostgreSQL to a managed service
Replace the Docker PostgreSQL container with a managed instance:
- **AWS:** RDS for PostgreSQL (db.t4g.medium 2 vCPU, 4 GB, ~$70/mo)
- **GCP:** Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo)
- **DigitalOcean:** Managed Databases ($60/mo for 2 vCPU / 4 GB)
**Changes required:**
1. Update `.env` to point `DATABASE_URL` at the managed instance
2. In `docker-compose.prod.yml`, disable the postgres container:
```yaml
postgres:
deploy:
replicas: 0
```
3. Remove the `depends_on: postgres` from the backend service
4. Ensure the managed DB allows connections from your VM's IP
**Benefits:** Automated backups, point-in-time recovery, read replicas,
automatic failover, no memory/CPU contention with the application.
### 2.2 Move Redis to a managed service
Replace the Docker Redis container similarly:
- **AWS:** ElastiCache (cache.t4g.micro, ~$15/mo)
- **DigitalOcean:** Managed Redis ($15/mo)
Update `REDIS_URL` in `.env` and disable the container.
### Phase 2 resource reclaim
Offloading DB and cache frees ~5 GB of reserved memory on the VM,
leaving the full 24 GB available for backend scaling (Phase 3).
---
## Phase 3: Horizontal Scaling (Multiple Backend Instances)
**When:** Single backend container hits CPU ceiling (4 workers maxed),
or you need zero-downtime deployments.
### 3.1 Run multiple backend replicas with Docker Compose
```yaml
# docker-compose.prod.yml
backend:
deploy:
replicas: 2 # 2 containers × 4 workers = 8 workers
resources:
limits:
memory: 2048M
reservations:
memory: 512M
```
**Important:** With replicas > 1 you cannot use `ports:` directly.
Switch the host nginx upstream to use Docker's internal DNS:
```nginx
# /etc/nginx/sites-available/your-site
upstream backend {
# Docker Compose assigns container IPs dynamically.
# Use a resolver to look up the service name.
server 127.0.0.1:3000;
server 127.0.0.1:3010; # second replica on different host port
}
```
Alternatively, use Docker Compose port ranges:
```yaml
backend:
ports:
- "127.0.0.1:3000-3009:3000"
deploy:
replicas: 2
```
### 3.2 Connection pool considerations
Each backend container runs up to 4 workers, each with its own connection
pool. With the default pool size of 30:
| Replicas | Workers | Max DB connections |
|----------|---------|-------------------|
| 1 | 4 | 120 |
| 2 | 8 | 240 |
| 3 | 12 | 360 |
If using managed PostgreSQL, ensure `max_connections` on the DB is high
enough. For > 2 replicas, consider adding **PgBouncer** as a connection
pooler (transaction-mode pooling) to multiplex connections:
```
Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL
```
### 3.3 Session and state considerations
The application currently uses **stateless JWT authentication** — no
server-side sessions. This means backend replicas can handle any request
without sticky sessions. Redis is used for caching only. This architecture
is already horizontal-ready.
### Phase 3 capacity estimate
| Replicas | Concurrent users | API req/sec |
|----------|-----------------|-------------|
| 2 | 5001,000 | 8001,500 |
| 3 | 1,0002,000 | 1,5002,500 |
---
## Phase 4: Full Horizontal (Multi-Node)
**When:** Single VM resources exhausted, or you need geographic distribution
and high availability.
### 4.1 Docker Swarm (simplest multi-node)
Docker Swarm is the easiest migration from Docker Compose. The compose
files are already compatible:
```bash
# On the manager node
docker swarm init
# On worker nodes
docker swarm join --token <token> <manager-ip>:2377
# Deploy the stack
docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq
```
Scale the backend across nodes:
```bash
docker service scale hoaledgeriq_backend=4
```
Swarm handles load balancing across nodes via its built-in ingress network.
### 4.2 Kubernetes (full orchestration)
For larger deployments, migrate to Kubernetes:
- **Backend:** Deployment with HPA (Horizontal Pod Autoscaler) on CPU
- **Frontend:** Deployment with 2+ replicas behind a Service
- **PostgreSQL:** External managed service (not in the cluster)
- **Redis:** External managed service or StatefulSet
- **Ingress:** nginx-ingress or cloud load balancer
This is a significant migration but provides auto-scaling, self-healing,
rolling deployments, and multi-region capability.
### 4.3 CDN for static assets
At any point in the scaling journey, a CDN provides the biggest return on
investment for frontend performance:
- **Cloudflare** (free tier works): Proxy DNS, caches static assets at edge
- **AWS CloudFront** or **GCP Cloud CDN**: More control, ~$0.085/GB
This eliminates nearly all load on the frontend nginx container and reduces
latency for geographically distributed users. Static assets (JS, CSS,
images) are served from edge nodes instead of your VM.
---
## Component-by-Component Scaling Reference
### Backend (NestJS)
| Approach | When | How |
|----------|------|-----|
| Tune worker count | CPU underused | Set `WORKERS` env var or modify `main.ts` cap |
| Increase memory limit | OOM or >80% usage | Raise `deploy.resources.limits.memory` |
| Add replicas | CPU maxed at 4 workers | `deploy.replicas: N` in compose |
| Move to separate VM | VM resources exhausted | Run backend on dedicated compute |
**Current clustering logic** (from `backend/src/main.ts`):
- Production: `Math.min(os.cpus().length, 4)` workers
- Development: 1 worker
- To allow more than 4 workers, change the cap in `main.ts`
### PostgreSQL
| Approach | When | How |
|----------|------|-----|
| Increase shared_buffers | Cache hit ratio < 99% | Tune postgres command args |
| Increase max_connections | Pool exhaustion errors | Increase in postgres command + add PgBouncer |
| Add read replica | Read-heavy workload | Managed DB feature or streaming replication |
| Vertical scale | Query latency high | Larger managed DB instance |
**Key queries to monitor:**
```sql
-- Connection usage
SELECT count(*) AS active, max_conn FROM pg_stat_activity,
(SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s
GROUP BY max_conn;
-- Cache hit ratio (should be > 99%)
SELECT
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
FROM pg_statio_user_tables;
-- Slow queries (if pg_stat_statements is enabled)
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
```
### Redis
| Approach | When | How |
|----------|------|-----|
| Increase maxmemory | Evictions happening frequently | Change `--maxmemory` in compose command |
| Move to managed | Need persistence guarantees | AWS ElastiCache / DigitalOcean Managed Redis |
| Add replica | Read-heavy caching | Managed service with read replicas |
### Host Nginx
| Approach | When | How |
|----------|------|-----|
| Tune worker_connections | Connection refused errors | Increase in `/etc/nginx/nginx.conf` |
| Add upstream servers | Multiple backend replicas | upstream block with multiple servers |
| Move to load balancer | Multi-node deployment | Cloud LB (ALB, GCP LB) or HAProxy |
| Add CDN | Static asset latency | Cloudflare, CloudFront, etc. |
---
## Docker Daemon Tuning
These settings are applied on the host in `/etc/docker/daemon.json`:
```json
{
"userland-proxy": false,
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "3"
},
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 65536,
"Soft": 65536
}
}
}
```
| Setting | Purpose |
|---------|---------|
| `userland-proxy: false` | Kernel-level port forwarding instead of userspace Go proxy (already applied) |
| `log-opts` | Prevents Docker container logs from filling the disk |
| `default-ulimits.nofile` | Raises file descriptor limit for containers handling many connections |
After changing, restart Docker: `sudo systemctl restart docker`
---
## Monitoring with New Relic
New Relic is deployed on the backend via the conditional preload
(`NEW_RELIC_ENABLED=true` in `.env`). Key dashboards to set up:
### Alerts to configure
| Alert | Condition | Priority |
|-------|-----------|----------|
| High error rate | > 1% for 5 minutes | Critical |
| Slow transactions | p95 > 2s for 5 minutes | Critical |
| Apdex score drop | < 0.7 for 10 minutes | Warning |
| Memory usage | > 80% of container limit for 10 minutes | Warning |
| Transaction throughput drop | > 50% decrease vs. baseline | Warning |
### Key transactions to monitor
| Endpoint | Why |
|----------|-----|
| `POST /api/auth/login` | Authentication performance, first thing every user hits |
| `GET /api/journal-entries` | Heaviest read query (double-entry bookkeeping with lines) |
| `POST /api/investment-planning/recommendations` | AI endpoint, 30180s response time, external dependency |
| `GET /api/reports/*` | Financial reports with aggregate queries |
| `GET /api/projects` | Includes real-time funding computation across all reserve projects |
### Infrastructure metrics to export
If you later add the New Relic Infrastructure agent to the host VM,
you can correlate application performance with system metrics:
```bash
# Install on the host (not in Docker)
curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
sudo NEW_RELIC_API_KEY=<your-key> NEW_RELIC_ACCOUNT_ID=<your-id> \
/usr/local/bin/newrelic install -n infrastructure-agent-installer
```
This provides host-level CPU, memory, disk, and network metrics alongside
your application telemetry.
---
## Quick Reference — Scaling Decision Tree
```
Is API response time (p95) > 500ms?
├── Yes → Is backend CPU > 80%?
│ ├── Yes → Phase 1: Already at 4 workers?
│ │ ├── Yes → Phase 3: Add backend replicas
│ │ └── No → Raise worker cap in main.ts
│ └── No → Is PostgreSQL slow?
│ ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB
│ └── No → Profile the slow endpoints in New Relic
├── No → Is memory > 80% on any container?
│ ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free)
│ └── No → Is disk > 80%?
│ ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation
│ └── No → No scaling needed
```