Files
HOA_Financial_Platform/docs/SCALING.md
olsch01 3790a3bd9e docs: add scaling guide for production infrastructure
Covers vertical tuning, managed service offloading, horizontal scaling
with replicas, and multi-node strategies. Includes resource budgets for
the current 4-core/24GB VM, monitoring thresholds for New Relic alerts,
PostgreSQL/Redis tuning values, and a scaling decision tree.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 15:06:22 -05:00

18 KiB
Raw Blame History

HOA LedgerIQ — Scaling Guide

Version: 2026.3.2 (beta) Last updated: 2026-03-03 Current infrastructure: 4 ARM cores, 24 GB RAM, single VM


Table of Contents

  1. Current Architecture Baseline
  2. Resource Budget — Where Your 24 GB Goes
  3. Scaling Signals — When to Act
  4. Phase 1: Vertical Tuning (Same VM)
  5. Phase 2: Offload Services (Managed DB + Cache)
  6. Phase 3: Horizontal Scaling (Multiple Backend Instances)
  7. Phase 4: Full Horizontal (Multi-Node)
  8. Component-by-Component Scaling Reference
  9. Docker Daemon Tuning
  10. Monitoring with New Relic

Current Architecture Baseline

  Internet
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Host VM  (4 ARM cores, 24 GB RAM)                      │
│                                                         │
│  ┌──────────────────────────────────┐                   │
│  │  Host nginx :80/:443 (SSL)      │                   │
│  │  /api/* → 127.0.0.1:3000        │                   │
│  │  /*     → 127.0.0.1:3001        │                   │
│  └──────────┬───────────┬──────────┘                   │
│             ▼           ▼                               │
│  ┌──────────────┐ ┌──────────────┐    Docker (hoanet)  │
│  │ backend :3000│ │frontend :3001│                      │
│  │  4 workers   │ │ static nginx │                      │
│  │  1024 MB cap │ │  ~5 MB used  │                      │
│  └──────┬───────┘ └──────────────┘                      │
│    ┌────┴────┐                                          │
│    ▼         ▼                                          │
│  ┌────────────┐ ┌───────────┐                           │
│  │postgres    │ │redis      │                           │
│  │ 1024 MB cap│ │ 256 MB cap│                           │
│  └────────────┘ └───────────┘                           │
└─────────────────────────────────────────────────────────┘

How requests flow:

  1. Browser hits host nginx (SSL termination, rate limiting)
  2. API requests proxy to the NestJS backend (4 clustered workers)
  3. Static asset requests proxy to the frontend nginx container
  4. Backend queries PostgreSQL and Redis over the Docker bridge network
  5. All inter-container traffic stays on the hoanet bridge (kernel-routed, no userland proxy)

Key configuration facts:

Component Current config Bottleneck at scale
Backend 4 Node.js workers (1 per core) CPU-bound under heavy API load
PostgreSQL 200 max connections, 256 MB shared_buffers Connection count, then memory
Redis 256 MB maxmemory, LRU eviction Memory, then network
Frontend Static nginx, ~5 MB memory Effectively unlimited for static serving
Host nginx Rate limit: 10 req/s per IP, burst 30 File descriptors, worker connections

Resource Budget — Where Your 24 GB Goes

Component Memory limit Typical usage Notes
Backend 1024 MB 250400 MB 4 workers share one container limit
PostgreSQL 1024 MB 50300 MB Grows with active queries and shared_buffers
Redis 256 MB 310 MB Very low until caching is heavily used
Frontend None set ~5 MB Static nginx, negligible
Host nginx N/A (host) ~10 MB Runs on the host, not in Docker
New Relic agent (inside backend) ~3050 MB Included in backend memory
Total reserved ~2.3 GB ~500 MB idle ~21.5 GB available for growth

You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed.


Scaling Signals — When to Act

Use these thresholds from New Relic and system metrics to decide when to scale:

Immediate action required

Signal Threshold Likely bottleneck
API response time (p95) > 2 seconds Backend CPU or DB queries
Error rate > 1% of requests Backend memory, DB connections, or bugs
PostgreSQL connection wait time > 100 ms Connection pool exhaustion
Container OOM kills Any occurrence Memory limit too low

Plan scaling within 24 weeks

Signal Threshold Likely bottleneck
API response time (p95) > 500 ms sustained Backend approaching CPU saturation
Backend CPU (container) > 80% sustained Need more workers or replicas
PostgreSQL CPU > 70% sustained Query optimization or read replicas
PostgreSQL connections > 150 of 200 Pool size or connection leaks
Redis memory > 200 MB of 256 MB Increase limit or review eviction
Host disk usage > 80% Postgres WAL or Docker image bloat

No action needed

Signal Range Meaning
Backend CPU < 50% Normal headroom
API response time (p95) < 200 ms Healthy
PostgreSQL connections < 100 Plenty of capacity
Memory usage (all containers) < 60% of limits Well-sized

Phase 1: Vertical Tuning (Same VM)

When: 50200 concurrent users, response times starting to climb. Cost: Free — just configuration changes.

1.1 Increase backend memory limit

The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses 60100 MB at baseline. Under load with New Relic active, they can reach 150 MB each (600 MB total). Raise the limit to give headroom:

# docker-compose.prod.yml
backend:
  deploy:
    resources:
      limits:
        memory: 2048M      # was 1024M
      reservations:
        memory: 512M       # was 256M

1.2 Tune PostgreSQL for available RAM

With 24 GB on the host, PostgreSQL can use significantly more memory. These settings assume PostgreSQL is the only memory-heavy workload besides the backend:

# docker-compose.prod.yml
postgres:
  command: >
    postgres
      -c max_connections=200
      -c shared_buffers=1GB           # was 256MB (25% of 4GB rule of thumb)
      -c effective_cache_size=4GB     # was 512MB (OS page cache estimate)
      -c work_mem=16MB                # was 4MB (per-sort memory)
      -c maintenance_work_mem=256MB   # was 64MB (VACUUM, CREATE INDEX)
      -c checkpoint_completion_target=0.9
      -c wal_buffers=64MB             # was 16MB
      -c random_page_cost=1.1
  deploy:
    resources:
      limits:
        memory: 4096M                 # was 1024M
      reservations:
        memory: 1024M                 # was 512M

1.3 Increase Redis memory

If you start using Redis for session storage or response caching:

# docker-compose.prod.yml
redis:
  command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru

1.4 Tune host nginx worker connections

# /etc/nginx/nginx.conf (host)
worker_processes auto;          # matches CPU cores (4)
events {
    worker_connections 2048;    # default is often 768
    multi_accept on;
}

Phase 1 capacity estimate

Metric Estimate
Concurrent users 200500
API requests/sec 400800
Tenants 50100

Phase 2: Offload Services (Managed DB + Cache)

When: 500+ concurrent users, or you need high availability / automated backups. Cost: $50200/month depending on provider and tier.

2.1 Move PostgreSQL to a managed service

Replace the Docker PostgreSQL container with a managed instance:

  • AWS: RDS for PostgreSQL (db.t4g.medium — 2 vCPU, 4 GB, ~$70/mo)
  • GCP: Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo)
  • DigitalOcean: Managed Databases ($60/mo for 2 vCPU / 4 GB)

Changes required:

  1. Update .env to point DATABASE_URL at the managed instance
  2. In docker-compose.prod.yml, disable the postgres container:
    postgres:
      deploy:
        replicas: 0
    
  3. Remove the depends_on: postgres from the backend service
  4. Ensure the managed DB allows connections from your VM's IP

Benefits: Automated backups, point-in-time recovery, read replicas, automatic failover, no memory/CPU contention with the application.

2.2 Move Redis to a managed service

Replace the Docker Redis container similarly:

  • AWS: ElastiCache (cache.t4g.micro, ~$15/mo)
  • DigitalOcean: Managed Redis ($15/mo)

Update REDIS_URL in .env and disable the container.

Phase 2 resource reclaim

Offloading DB and cache frees ~5 GB of reserved memory on the VM, leaving the full 24 GB available for backend scaling (Phase 3).


Phase 3: Horizontal Scaling (Multiple Backend Instances)

When: Single backend container hits CPU ceiling (4 workers maxed), or you need zero-downtime deployments.

3.1 Run multiple backend replicas with Docker Compose

# docker-compose.prod.yml
backend:
  deploy:
    replicas: 2                       # 2 containers × 4 workers = 8 workers
    resources:
      limits:
        memory: 2048M
      reservations:
        memory: 512M

Important: With replicas > 1 you cannot use ports: directly. Switch the host nginx upstream to use Docker's internal DNS:

# /etc/nginx/sites-available/your-site
upstream backend {
    # Docker Compose assigns container IPs dynamically.
    # Use a resolver to look up the service name.
    server 127.0.0.1:3000;
    server 127.0.0.1:3010;    # second replica on different host port
}

Alternatively, use Docker Compose port ranges:

backend:
  ports:
    - "127.0.0.1:3000-3009:3000"
  deploy:
    replicas: 2

3.2 Connection pool considerations

Each backend container runs up to 4 workers, each with its own connection pool. With the default pool size of 30:

Replicas Workers Max DB connections
1 4 120
2 8 240
3 12 360

If using managed PostgreSQL, ensure max_connections on the DB is high enough. For > 2 replicas, consider adding PgBouncer as a connection pooler (transaction-mode pooling) to multiplex connections:

Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL

3.3 Session and state considerations

The application currently uses stateless JWT authentication — no server-side sessions. This means backend replicas can handle any request without sticky sessions. Redis is used for caching only. This architecture is already horizontal-ready.

Phase 3 capacity estimate

Replicas Concurrent users API req/sec
2 5001,000 8001,500
3 1,0002,000 1,5002,500

Phase 4: Full Horizontal (Multi-Node)

When: Single VM resources exhausted, or you need geographic distribution and high availability.

4.1 Docker Swarm (simplest multi-node)

Docker Swarm is the easiest migration from Docker Compose. The compose files are already compatible:

# On the manager node
docker swarm init

# On worker nodes
docker swarm join --token <token> <manager-ip>:2377

# Deploy the stack
docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq

Scale the backend across nodes:

docker service scale hoaledgeriq_backend=4

Swarm handles load balancing across nodes via its built-in ingress network.

4.2 Kubernetes (full orchestration)

For larger deployments, migrate to Kubernetes:

  • Backend: Deployment with HPA (Horizontal Pod Autoscaler) on CPU
  • Frontend: Deployment with 2+ replicas behind a Service
  • PostgreSQL: External managed service (not in the cluster)
  • Redis: External managed service or StatefulSet
  • Ingress: nginx-ingress or cloud load balancer

This is a significant migration but provides auto-scaling, self-healing, rolling deployments, and multi-region capability.

4.3 CDN for static assets

At any point in the scaling journey, a CDN provides the biggest return on investment for frontend performance:

  • Cloudflare (free tier works): Proxy DNS, caches static assets at edge
  • AWS CloudFront or GCP Cloud CDN: More control, ~$0.085/GB

This eliminates nearly all load on the frontend nginx container and reduces latency for geographically distributed users. Static assets (JS, CSS, images) are served from edge nodes instead of your VM.


Component-by-Component Scaling Reference

Backend (NestJS)

Approach When How
Tune worker count CPU underused Set WORKERS env var or modify main.ts cap
Increase memory limit OOM or >80% usage Raise deploy.resources.limits.memory
Add replicas CPU maxed at 4 workers deploy.replicas: N in compose
Move to separate VM VM resources exhausted Run backend on dedicated compute

Current clustering logic (from backend/src/main.ts):

  • Production: Math.min(os.cpus().length, 4) workers
  • Development: 1 worker
  • To allow more than 4 workers, change the cap in main.ts

PostgreSQL

Approach When How
Increase shared_buffers Cache hit ratio < 99% Tune postgres command args
Increase max_connections Pool exhaustion errors Increase in postgres command + add PgBouncer
Add read replica Read-heavy workload Managed DB feature or streaming replication
Vertical scale Query latency high Larger managed DB instance

Key queries to monitor:

-- Connection usage
SELECT count(*) AS active, max_conn FROM pg_stat_activity,
  (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s
GROUP BY max_conn;

-- Cache hit ratio (should be > 99%)
SELECT
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
FROM pg_statio_user_tables;

-- Slow queries (if pg_stat_statements is enabled)
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

Redis

Approach When How
Increase maxmemory Evictions happening frequently Change --maxmemory in compose command
Move to managed Need persistence guarantees AWS ElastiCache / DigitalOcean Managed Redis
Add replica Read-heavy caching Managed service with read replicas

Host Nginx

Approach When How
Tune worker_connections Connection refused errors Increase in /etc/nginx/nginx.conf
Add upstream servers Multiple backend replicas upstream block with multiple servers
Move to load balancer Multi-node deployment Cloud LB (ALB, GCP LB) or HAProxy
Add CDN Static asset latency Cloudflare, CloudFront, etc.

Docker Daemon Tuning

These settings are applied on the host in /etc/docker/daemon.json:

{
  "userland-proxy": false,
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3"
  },
  "default-ulimits": {
    "nofile": {
      "Name": "nofile",
      "Hard": 65536,
      "Soft": 65536
    }
  }
}
Setting Purpose
userland-proxy: false Kernel-level port forwarding instead of userspace Go proxy (already applied)
log-opts Prevents Docker container logs from filling the disk
default-ulimits.nofile Raises file descriptor limit for containers handling many connections

After changing, restart Docker: sudo systemctl restart docker


Monitoring with New Relic

New Relic is deployed on the backend via the conditional preload (NEW_RELIC_ENABLED=true in .env). Key dashboards to set up:

Alerts to configure

Alert Condition Priority
High error rate > 1% for 5 minutes Critical
Slow transactions p95 > 2s for 5 minutes Critical
Apdex score drop < 0.7 for 10 minutes Warning
Memory usage > 80% of container limit for 10 minutes Warning
Transaction throughput drop > 50% decrease vs. baseline Warning

Key transactions to monitor

Endpoint Why
POST /api/auth/login Authentication performance, first thing every user hits
GET /api/journal-entries Heaviest read query (double-entry bookkeeping with lines)
POST /api/investment-planning/recommendations AI endpoint, 30180s response time, external dependency
GET /api/reports/* Financial reports with aggregate queries
GET /api/projects Includes real-time funding computation across all reserve projects

Infrastructure metrics to export

If you later add the New Relic Infrastructure agent to the host VM, you can correlate application performance with system metrics:

# Install on the host (not in Docker)
curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
sudo NEW_RELIC_API_KEY=<your-key> NEW_RELIC_ACCOUNT_ID=<your-id> \
  /usr/local/bin/newrelic install -n infrastructure-agent-installer

This provides host-level CPU, memory, disk, and network metrics alongside your application telemetry.


Quick Reference — Scaling Decision Tree

Is API response time (p95) > 500ms?
├── Yes → Is backend CPU > 80%?
│   ├── Yes → Phase 1: Already at 4 workers?
│   │   ├── Yes → Phase 3: Add backend replicas
│   │   └── No  → Raise worker cap in main.ts
│   └── No  → Is PostgreSQL slow?
│       ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB
│       └── No  → Profile the slow endpoints in New Relic
├── No  → Is memory > 80% on any container?
│   ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free)
│   └── No  → Is disk > 80%?
│       ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation
│       └── No  → No scaling needed