Files

olsch01 3790a3bd9e docs: add scaling guide for production infrastructure

Covers vertical tuning, managed service offloading, horizontal scaling
with replicas, and multi-node strategies. Includes resource budgets for
the current 4-core/24GB VM, monitoring thresholds for New Relic alerts,
PostgreSQL/Redis tuning values, and a scaling decision tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-03 15:06:22 -05:00

18 KiB

Raw Blame History

HOA LedgerIQ — Scaling Guide

Version: 2026.3.2 (beta) Last updated: 2026-03-03 Current infrastructure: 4 ARM cores, 24 GB RAM, single VM

Current Architecture Baseline
Resource Budget — Where Your 24 GB Goes
Scaling Signals — When to Act
Phase 1: Vertical Tuning (Same VM)
Phase 2: Offload Services (Managed DB + Cache)
Phase 3: Horizontal Scaling (Multiple Backend Instances)
Phase 4: Full Horizontal (Multi-Node)
Component-by-Component Scaling Reference
Docker Daemon Tuning
Monitoring with New Relic

Current Architecture Baseline

  Internet
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Host VM  (4 ARM cores, 24 GB RAM)                      │
│                                                         │
│  ┌──────────────────────────────────┐                   │
│  │  Host nginx :80/:443 (SSL)      │                   │
│  │  /api/* → 127.0.0.1:3000        │                   │
│  │  /*     → 127.0.0.1:3001        │                   │
│  └──────────┬───────────┬──────────┘                   │
│             ▼           ▼                               │
│  ┌──────────────┐ ┌──────────────┐    Docker (hoanet)  │
│  │ backend :3000│ │frontend :3001│                      │
│  │  4 workers   │ │ static nginx │                      │
│  │  1024 MB cap │ │  ~5 MB used  │                      │
│  └──────┬───────┘ └──────────────┘                      │
│    ┌────┴────┐                                          │
│    ▼         ▼                                          │
│  ┌────────────┐ ┌───────────┐                           │
│  │postgres    │ │redis      │                           │
│  │ 1024 MB cap│ │ 256 MB cap│                           │
│  └────────────┘ └───────────┘                           │
└─────────────────────────────────────────────────────────┘

How requests flow:

Browser hits host nginx (SSL termination, rate limiting)
API requests proxy to the NestJS backend (4 clustered workers)
Static asset requests proxy to the frontend nginx container
Backend queries PostgreSQL and Redis over the Docker bridge network
All inter-container traffic stays on the hoanet bridge (kernel-routed, no userland proxy)

Key configuration facts:

Component	Current config	Bottleneck at scale
Backend	4 Node.js workers (1 per core)	CPU-bound under heavy API load
PostgreSQL	200 max connections, 256 MB shared_buffers	Connection count, then memory
Redis	256 MB maxmemory, LRU eviction	Memory, then network
Frontend	Static nginx, ~5 MB memory	Effectively unlimited for static serving
Host nginx	Rate limit: 10 req/s per IP, burst 30	File descriptors, worker connections

Resource Budget — Where Your 24 GB Goes

Component	Memory limit	Typical usage	Notes
Backend	1024 MB	250–400 MB	4 workers share one container limit
PostgreSQL	1024 MB	50–300 MB	Grows with active queries and shared_buffers
Redis	256 MB	3–10 MB	Very low until caching is heavily used
Frontend	None set	~5 MB	Static nginx, negligible
Host nginx	N/A (host)	~10 MB	Runs on the host, not in Docker
New Relic agent	(inside backend)	~30–50 MB	Included in backend memory
Total reserved	~2.3 GB	~500 MB idle	~21.5 GB available for growth

You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed.

Scaling Signals — When to Act

Use these thresholds from New Relic and system metrics to decide when to scale:

Immediate action required

Signal	Threshold	Likely bottleneck
API response time (p95)	> 2 seconds	Backend CPU or DB queries
Error rate	> 1% of requests	Backend memory, DB connections, or bugs
PostgreSQL connection wait time	> 100 ms	Connection pool exhaustion
Container OOM kills	Any occurrence	Memory limit too low

Plan scaling within 2–4 weeks

Signal	Threshold	Likely bottleneck
API response time (p95)	> 500 ms sustained	Backend approaching CPU saturation
Backend CPU (container)	> 80% sustained	Need more workers or replicas
PostgreSQL CPU	> 70% sustained	Query optimization or read replicas
PostgreSQL connections	> 150 of 200	Pool size or connection leaks
Redis memory	> 200 MB of 256 MB	Increase limit or review eviction
Host disk usage	> 80%	Postgres WAL or Docker image bloat

No action needed

Signal	Range	Meaning
Backend CPU	< 50%	Normal headroom
API response time (p95)	< 200 ms	Healthy
PostgreSQL connections	< 100	Plenty of capacity
Memory usage (all containers)	< 60% of limits	Well-sized

Phase 1: Vertical Tuning (Same VM)

When: 50–200 concurrent users, response times starting to climb. Cost: Free — just configuration changes.

1.1 Increase backend memory limit

The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses 60–100 MB at baseline. Under load with New Relic active, they can reach 150 MB each (600 MB total). Raise the limit to give headroom:

# docker-compose.prod.yml
backend:
  deploy:
    resources:
      limits:
        memory: 2048M      # was 1024M
      reservations:
        memory: 512M       # was 256M

1.2 Tune PostgreSQL for available RAM

With 24 GB on the host, PostgreSQL can use significantly more memory. These settings assume PostgreSQL is the only memory-heavy workload besides the backend:

# docker-compose.prod.yml
postgres:
  command: >
    postgres
      -c max_connections=200
      -c shared_buffers=1GB           # was 256MB (25% of 4GB rule of thumb)
      -c effective_cache_size=4GB     # was 512MB (OS page cache estimate)
      -c work_mem=16MB                # was 4MB (per-sort memory)
      -c maintenance_work_mem=256MB   # was 64MB (VACUUM, CREATE INDEX)
      -c checkpoint_completion_target=0.9
      -c wal_buffers=64MB             # was 16MB
      -c random_page_cost=1.1
  deploy:
    resources:
      limits:
        memory: 4096M                 # was 1024M
      reservations:
        memory: 1024M                 # was 512M

1.3 Increase Redis memory

If you start using Redis for session storage or response caching:

# docker-compose.prod.yml
redis:
  command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru

1.4 Tune host nginx worker connections

# /etc/nginx/nginx.conf (host)
worker_processes auto;          # matches CPU cores (4)
events {
    worker_connections 2048;    # default is often 768
    multi_accept on;
}

Phase 1 capacity estimate

Metric	Estimate
Concurrent users	200–500
API requests/sec	400–800
Tenants	50–100

Phase 2: Offload Services (Managed DB + Cache)

When: 500+ concurrent users, or you need high availability / automated backups. Cost: $50–200/month depending on provider and tier.

2.1 Move PostgreSQL to a managed service

Replace the Docker PostgreSQL container with a managed instance:

AWS: RDS for PostgreSQL (db.t4g.medium — 2 vCPU, 4 GB, ~$70/mo)
GCP: Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo)
DigitalOcean: Managed Databases ($60/mo for 2 vCPU / 4 GB)

Changes required:

Update .env to point DATABASE_URL at the managed instance
In docker-compose.prod.yml, disable the postgres container:
```
postgres:
  deploy:
    replicas: 0
```
Remove the depends_on: postgres from the backend service
Ensure the managed DB allows connections from your VM's IP

Benefits: Automated backups, point-in-time recovery, read replicas, automatic failover, no memory/CPU contention with the application.

2.2 Move Redis to a managed service

Replace the Docker Redis container similarly:

AWS: ElastiCache (cache.t4g.micro, ~$15/mo)
DigitalOcean: Managed Redis ($15/mo)

Update REDIS_URL in .env and disable the container.

Phase 2 resource reclaim

Offloading DB and cache frees ~5 GB of reserved memory on the VM, leaving the full 24 GB available for backend scaling (Phase 3).

Phase 3: Horizontal Scaling (Multiple Backend Instances)

When: Single backend container hits CPU ceiling (4 workers maxed), or you need zero-downtime deployments.

3.1 Run multiple backend replicas with Docker Compose

# docker-compose.prod.yml
backend:
  deploy:
    replicas: 2                       # 2 containers × 4 workers = 8 workers
    resources:
      limits:
        memory: 2048M
      reservations:
        memory: 512M

Important: With replicas > 1 you cannot use ports: directly. Switch the host nginx upstream to use Docker's internal DNS:

# /etc/nginx/sites-available/your-site
upstream backend {
    # Docker Compose assigns container IPs dynamically.
    # Use a resolver to look up the service name.
    server 127.0.0.1:3000;
    server 127.0.0.1:3010;    # second replica on different host port
}

Alternatively, use Docker Compose port ranges:

backend:
  ports:
    - "127.0.0.1:3000-3009:3000"
  deploy:
    replicas: 2

3.2 Connection pool considerations

Each backend container runs up to 4 workers, each with its own connection pool. With the default pool size of 30:

Replicas	Workers	Max DB connections
1	4	120
2	8	240
3	12	360

If using managed PostgreSQL, ensure max_connections on the DB is high enough. For > 2 replicas, consider adding PgBouncer as a connection pooler (transaction-mode pooling) to multiplex connections:

Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL

3.3 Session and state considerations

The application currently uses stateless JWT authentication — no server-side sessions. This means backend replicas can handle any request without sticky sessions. Redis is used for caching only. This architecture is already horizontal-ready.

Phase 3 capacity estimate

Replicas	Concurrent users	API req/sec
2	500–1,000	800–1,500
3	1,000–2,000	1,500–2,500

Phase 4: Full Horizontal (Multi-Node)

When: Single VM resources exhausted, or you need geographic distribution and high availability.

4.1 Docker Swarm (simplest multi-node)

Docker Swarm is the easiest migration from Docker Compose. The compose files are already compatible:

# On the manager node
docker swarm init

# On worker nodes
docker swarm join --token <token> <manager-ip>:2377

# Deploy the stack
docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq

Scale the backend across nodes:

docker service scale hoaledgeriq_backend=4

Swarm handles load balancing across nodes via its built-in ingress network.

4.2 Kubernetes (full orchestration)

For larger deployments, migrate to Kubernetes:

Backend: Deployment with HPA (Horizontal Pod Autoscaler) on CPU
Frontend: Deployment with 2+ replicas behind a Service
PostgreSQL: External managed service (not in the cluster)
Redis: External managed service or StatefulSet
Ingress: nginx-ingress or cloud load balancer

This is a significant migration but provides auto-scaling, self-healing, rolling deployments, and multi-region capability.

4.3 CDN for static assets

At any point in the scaling journey, a CDN provides the biggest return on investment for frontend performance:

Cloudflare (free tier works): Proxy DNS, caches static assets at edge
AWS CloudFront or GCP Cloud CDN: More control, ~$0.085/GB

This eliminates nearly all load on the frontend nginx container and reduces latency for geographically distributed users. Static assets (JS, CSS, images) are served from edge nodes instead of your VM.

Component-by-Component Scaling Reference

Backend (NestJS)

Approach	When	How
Tune worker count	CPU underused	Set `WORKERS` env var or modify `main.ts` cap
Increase memory limit	OOM or >80% usage	Raise `deploy.resources.limits.memory`
Add replicas	CPU maxed at 4 workers	`deploy.replicas: N` in compose
Move to separate VM	VM resources exhausted	Run backend on dedicated compute

Current clustering logic (from backend/src/main.ts):

Production: Math.min(os.cpus().length, 4) workers
Development: 1 worker
To allow more than 4 workers, change the cap in main.ts

PostgreSQL

Approach	When	How
Increase shared_buffers	Cache hit ratio < 99%	Tune postgres command args
Increase max_connections	Pool exhaustion errors	Increase in postgres command + add PgBouncer
Add read replica	Read-heavy workload	Managed DB feature or streaming replication
Vertical scale	Query latency high	Larger managed DB instance

Key queries to monitor:

-- Connection usage
SELECT count(*) AS active, max_conn FROM pg_stat_activity,
  (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s
GROUP BY max_conn;

-- Cache hit ratio (should be > 99%)
SELECT
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
FROM pg_statio_user_tables;

-- Slow queries (if pg_stat_statements is enabled)
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Redis

Approach	When	How
Increase maxmemory	Evictions happening frequently	Change `--maxmemory` in compose command
Move to managed	Need persistence guarantees	AWS ElastiCache / DigitalOcean Managed Redis
Add replica	Read-heavy caching	Managed service with read replicas

Host Nginx

Approach	When	How
Tune worker_connections	Connection refused errors	Increase in `/etc/nginx/nginx.conf`
Add upstream servers	Multiple backend replicas	upstream block with multiple servers
Move to load balancer	Multi-node deployment	Cloud LB (ALB, GCP LB) or HAProxy
Add CDN	Static asset latency	Cloudflare, CloudFront, etc.

Docker Daemon Tuning

These settings are applied on the host in /etc/docker/daemon.json:

{
  "userland-proxy": false,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3"
  },
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 65536,
      "Soft": 65536
    }
  }
}

Setting	Purpose
`userland-proxy: false`	Kernel-level port forwarding instead of userspace Go proxy (already applied)
`log-opts`	Prevents Docker container logs from filling the disk
`default-ulimits.nofile`	Raises file descriptor limit for containers handling many connections

After changing, restart Docker: sudo systemctl restart docker

Monitoring with New Relic

New Relic is deployed on the backend via the conditional preload (NEW_RELIC_ENABLED=true in .env). Key dashboards to set up:

Alerts to configure

Alert	Condition	Priority
High error rate	> 1% for 5 minutes	Critical
Slow transactions	p95 > 2s for 5 minutes	Critical
Apdex score drop	< 0.7 for 10 minutes	Warning
Memory usage	> 80% of container limit for 10 minutes	Warning
Transaction throughput drop	> 50% decrease vs. baseline	Warning

Key transactions to monitor

Endpoint	Why
`POST /api/auth/login`	Authentication performance, first thing every user hits
`GET /api/journal-entries`	Heaviest read query (double-entry bookkeeping with lines)
`POST /api/investment-planning/recommendations`	AI endpoint, 30–180s response time, external dependency
`GET /api/reports/*`	Financial reports with aggregate queries
`GET /api/projects`	Includes real-time funding computation across all reserve projects

Infrastructure metrics to export

If you later add the New Relic Infrastructure agent to the host VM, you can correlate application performance with system metrics:

# Install on the host (not in Docker)
curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
sudo NEW_RELIC_API_KEY=<your-key> NEW_RELIC_ACCOUNT_ID=<your-id> \
  /usr/local/bin/newrelic install -n infrastructure-agent-installer

This provides host-level CPU, memory, disk, and network metrics alongside your application telemetry.

Quick Reference — Scaling Decision Tree

Is API response time (p95) > 500ms?
├── Yes → Is backend CPU > 80%?
│   ├── Yes → Phase 1: Already at 4 workers?
│   │   ├── Yes → Phase 3: Add backend replicas
│   │   └── No  → Raise worker cap in main.ts
│   └── No  → Is PostgreSQL slow?
│       ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB
│       └── No  → Profile the slow endpoints in New Relic
├── No  → Is memory > 80% on any container?
│   ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free)
│   └── No  → Is disk > 80%?
│       ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation
│       └── No  → No scaling needed

18 KiB Raw Blame History Unescape Escape