docs: add scaling guide for production infrastructure
Covers vertical tuning, managed service offloading, horizontal scaling with replicas, and multi-node strategies. Includes resource budgets for the current 4-core/24GB VM, monitoring thresholds for New Relic alerts, PostgreSQL/Redis tuning values, and a scaling decision tree. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
532
docs/SCALING.md
Normal file
532
docs/SCALING.md
Normal file
@@ -0,0 +1,532 @@
|
|||||||
|
# HOA LedgerIQ — Scaling Guide
|
||||||
|
|
||||||
|
**Version:** 2026.3.2 (beta)
|
||||||
|
**Last updated:** 2026-03-03
|
||||||
|
**Current infrastructure:** 4 ARM cores, 24 GB RAM, single VM
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Current Architecture Baseline](#current-architecture-baseline)
|
||||||
|
2. [Resource Budget — Where Your 24 GB Goes](#resource-budget--where-your-24-gb-goes)
|
||||||
|
3. [Scaling Signals — When to Act](#scaling-signals--when-to-act)
|
||||||
|
4. [Phase 1: Vertical Tuning (Same VM)](#phase-1-vertical-tuning-same-vm)
|
||||||
|
5. [Phase 2: Offload Services (Managed DB + Cache)](#phase-2-offload-services-managed-db--cache)
|
||||||
|
6. [Phase 3: Horizontal Scaling (Multiple Backend Instances)](#phase-3-horizontal-scaling-multiple-backend-instances)
|
||||||
|
7. [Phase 4: Full Horizontal (Multi-Node)](#phase-4-full-horizontal-multi-node)
|
||||||
|
8. [Component-by-Component Scaling Reference](#component-by-component-scaling-reference)
|
||||||
|
9. [Docker Daemon Tuning](#docker-daemon-tuning)
|
||||||
|
10. [Monitoring with New Relic](#monitoring-with-new-relic)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current Architecture Baseline
|
||||||
|
|
||||||
|
```
|
||||||
|
Internet
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ Host VM (4 ARM cores, 24 GB RAM) │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────────────────────────────┐ │
|
||||||
|
│ │ Host nginx :80/:443 (SSL) │ │
|
||||||
|
│ │ /api/* → 127.0.0.1:3000 │ │
|
||||||
|
│ │ /* → 127.0.0.1:3001 │ │
|
||||||
|
│ └──────────┬───────────┬──────────┘ │
|
||||||
|
│ ▼ ▼ │
|
||||||
|
│ ┌──────────────┐ ┌──────────────┐ Docker (hoanet) │
|
||||||
|
│ │ backend :3000│ │frontend :3001│ │
|
||||||
|
│ │ 4 workers │ │ static nginx │ │
|
||||||
|
│ │ 1024 MB cap │ │ ~5 MB used │ │
|
||||||
|
│ └──────┬───────┘ └──────────────┘ │
|
||||||
|
│ ┌────┴────┐ │
|
||||||
|
│ ▼ ▼ │
|
||||||
|
│ ┌────────────┐ ┌───────────┐ │
|
||||||
|
│ │postgres │ │redis │ │
|
||||||
|
│ │ 1024 MB cap│ │ 256 MB cap│ │
|
||||||
|
│ └────────────┘ └───────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**How requests flow:**
|
||||||
|
|
||||||
|
1. Browser hits host nginx (SSL termination, rate limiting)
|
||||||
|
2. API requests proxy to the NestJS backend (4 clustered workers)
|
||||||
|
3. Static asset requests proxy to the frontend nginx container
|
||||||
|
4. Backend queries PostgreSQL and Redis over the Docker bridge network
|
||||||
|
5. All inter-container traffic stays on the `hoanet` bridge (kernel-routed, no userland proxy)
|
||||||
|
|
||||||
|
**Key configuration facts:**
|
||||||
|
|
||||||
|
| Component | Current config | Bottleneck at scale |
|
||||||
|
|-----------|---------------|---------------------|
|
||||||
|
| Backend | 4 Node.js workers (1 per core) | CPU-bound under heavy API load |
|
||||||
|
| PostgreSQL | 200 max connections, 256 MB shared_buffers | Connection count, then memory |
|
||||||
|
| Redis | 256 MB maxmemory, LRU eviction | Memory, then network |
|
||||||
|
| Frontend | Static nginx, ~5 MB memory | Effectively unlimited for static serving |
|
||||||
|
| Host nginx | Rate limit: 10 req/s per IP, burst 30 | File descriptors, worker connections |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource Budget — Where Your 24 GB Goes
|
||||||
|
|
||||||
|
| Component | Memory limit | Typical usage | Notes |
|
||||||
|
|-----------|-------------|---------------|-------|
|
||||||
|
| Backend | 1024 MB | 250–400 MB | 4 workers share one container limit |
|
||||||
|
| PostgreSQL | 1024 MB | 50–300 MB | Grows with active queries and shared_buffers |
|
||||||
|
| Redis | 256 MB | 3–10 MB | Very low until caching is heavily used |
|
||||||
|
| Frontend | None set | ~5 MB | Static nginx, negligible |
|
||||||
|
| Host nginx | N/A (host) | ~10 MB | Runs on the host, not in Docker |
|
||||||
|
| New Relic agent | (inside backend) | ~30–50 MB | Included in backend memory |
|
||||||
|
| **Total reserved** | **~2.3 GB** | **~500 MB idle** | **~21.5 GB available for growth** |
|
||||||
|
|
||||||
|
You have significant headroom. The current configuration is conservative and can handle considerably more load before any changes are needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scaling Signals — When to Act
|
||||||
|
|
||||||
|
Use these thresholds from New Relic and system metrics to decide when to scale:
|
||||||
|
|
||||||
|
### Immediate action required
|
||||||
|
|
||||||
|
| Signal | Threshold | Likely bottleneck |
|
||||||
|
|--------|-----------|-------------------|
|
||||||
|
| API response time (p95) | > 2 seconds | Backend CPU or DB queries |
|
||||||
|
| Error rate | > 1% of requests | Backend memory, DB connections, or bugs |
|
||||||
|
| PostgreSQL connection wait time | > 100 ms | Connection pool exhaustion |
|
||||||
|
| Container OOM kills | Any occurrence | Memory limit too low |
|
||||||
|
|
||||||
|
### Plan scaling within 2–4 weeks
|
||||||
|
|
||||||
|
| Signal | Threshold | Likely bottleneck |
|
||||||
|
|--------|-----------|-------------------|
|
||||||
|
| API response time (p95) | > 500 ms sustained | Backend approaching CPU saturation |
|
||||||
|
| Backend CPU (container) | > 80% sustained | Need more workers or replicas |
|
||||||
|
| PostgreSQL CPU | > 70% sustained | Query optimization or read replicas |
|
||||||
|
| PostgreSQL connections | > 150 of 200 | Pool size or connection leaks |
|
||||||
|
| Redis memory | > 200 MB of 256 MB | Increase limit or review eviction |
|
||||||
|
| Host disk usage | > 80% | Postgres WAL or Docker image bloat |
|
||||||
|
|
||||||
|
### No action needed
|
||||||
|
|
||||||
|
| Signal | Range | Meaning |
|
||||||
|
|--------|-------|---------|
|
||||||
|
| Backend CPU | < 50% | Normal headroom |
|
||||||
|
| API response time (p95) | < 200 ms | Healthy |
|
||||||
|
| PostgreSQL connections | < 100 | Plenty of capacity |
|
||||||
|
| Memory usage (all containers) | < 60% of limits | Well-sized |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Vertical Tuning (Same VM)
|
||||||
|
|
||||||
|
**When:** 50–200 concurrent users, response times starting to climb.
|
||||||
|
**Cost:** Free — just configuration changes.
|
||||||
|
|
||||||
|
### 1.1 Increase backend memory limit
|
||||||
|
|
||||||
|
The backend runs 4 workers in a 1024 MB container. Each Node.js worker uses
|
||||||
|
60–100 MB at baseline. Under load with New Relic active, they can reach
|
||||||
|
150 MB each (600 MB total). Raise the limit to give headroom:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.prod.yml
|
||||||
|
backend:
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 2048M # was 1024M
|
||||||
|
reservations:
|
||||||
|
memory: 512M # was 256M
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Tune PostgreSQL for available RAM
|
||||||
|
|
||||||
|
With 24 GB on the host, PostgreSQL can use significantly more memory. These
|
||||||
|
settings assume PostgreSQL is the only memory-heavy workload besides the
|
||||||
|
backend:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.prod.yml
|
||||||
|
postgres:
|
||||||
|
command: >
|
||||||
|
postgres
|
||||||
|
-c max_connections=200
|
||||||
|
-c shared_buffers=1GB # was 256MB (25% of 4GB rule of thumb)
|
||||||
|
-c effective_cache_size=4GB # was 512MB (OS page cache estimate)
|
||||||
|
-c work_mem=16MB # was 4MB (per-sort memory)
|
||||||
|
-c maintenance_work_mem=256MB # was 64MB (VACUUM, CREATE INDEX)
|
||||||
|
-c checkpoint_completion_target=0.9
|
||||||
|
-c wal_buffers=64MB # was 16MB
|
||||||
|
-c random_page_cost=1.1
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 4096M # was 1024M
|
||||||
|
reservations:
|
||||||
|
memory: 1024M # was 512M
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 Increase Redis memory
|
||||||
|
|
||||||
|
If you start using Redis for session storage or response caching:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.prod.yml
|
||||||
|
redis:
|
||||||
|
command: redis-server --appendonly yes --maxmemory 1gb --maxmemory-policy allkeys-lru
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.4 Tune host nginx worker connections
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
# /etc/nginx/nginx.conf (host)
|
||||||
|
worker_processes auto; # matches CPU cores (4)
|
||||||
|
events {
|
||||||
|
worker_connections 2048; # default is often 768
|
||||||
|
multi_accept on;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 1 capacity estimate
|
||||||
|
|
||||||
|
| Metric | Estimate |
|
||||||
|
|--------|----------|
|
||||||
|
| Concurrent users | 200–500 |
|
||||||
|
| API requests/sec | 400–800 |
|
||||||
|
| Tenants | 50–100 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Offload Services (Managed DB + Cache)
|
||||||
|
|
||||||
|
**When:** 500+ concurrent users, or you need high availability / automated backups.
|
||||||
|
**Cost:** $50–200/month depending on provider and tier.
|
||||||
|
|
||||||
|
### 2.1 Move PostgreSQL to a managed service
|
||||||
|
|
||||||
|
Replace the Docker PostgreSQL container with a managed instance:
|
||||||
|
- **AWS:** RDS for PostgreSQL (db.t4g.medium — 2 vCPU, 4 GB, ~$70/mo)
|
||||||
|
- **GCP:** Cloud SQL for PostgreSQL (db-custom-2-4096, ~$65/mo)
|
||||||
|
- **DigitalOcean:** Managed Databases ($60/mo for 2 vCPU / 4 GB)
|
||||||
|
|
||||||
|
**Changes required:**
|
||||||
|
|
||||||
|
1. Update `.env` to point `DATABASE_URL` at the managed instance
|
||||||
|
2. In `docker-compose.prod.yml`, disable the postgres container:
|
||||||
|
```yaml
|
||||||
|
postgres:
|
||||||
|
deploy:
|
||||||
|
replicas: 0
|
||||||
|
```
|
||||||
|
3. Remove the `depends_on: postgres` from the backend service
|
||||||
|
4. Ensure the managed DB allows connections from your VM's IP
|
||||||
|
|
||||||
|
**Benefits:** Automated backups, point-in-time recovery, read replicas,
|
||||||
|
automatic failover, no memory/CPU contention with the application.
|
||||||
|
|
||||||
|
### 2.2 Move Redis to a managed service
|
||||||
|
|
||||||
|
Replace the Docker Redis container similarly:
|
||||||
|
- **AWS:** ElastiCache (cache.t4g.micro, ~$15/mo)
|
||||||
|
- **DigitalOcean:** Managed Redis ($15/mo)
|
||||||
|
|
||||||
|
Update `REDIS_URL` in `.env` and disable the container.
|
||||||
|
|
||||||
|
### Phase 2 resource reclaim
|
||||||
|
|
||||||
|
Offloading DB and cache frees ~5 GB of reserved memory on the VM,
|
||||||
|
leaving the full 24 GB available for backend scaling (Phase 3).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Horizontal Scaling (Multiple Backend Instances)
|
||||||
|
|
||||||
|
**When:** Single backend container hits CPU ceiling (4 workers maxed),
|
||||||
|
or you need zero-downtime deployments.
|
||||||
|
|
||||||
|
### 3.1 Run multiple backend replicas with Docker Compose
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# docker-compose.prod.yml
|
||||||
|
backend:
|
||||||
|
deploy:
|
||||||
|
replicas: 2 # 2 containers × 4 workers = 8 workers
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 2048M
|
||||||
|
reservations:
|
||||||
|
memory: 512M
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important:** With replicas > 1 you cannot use `ports:` directly.
|
||||||
|
Switch the host nginx upstream to use Docker's internal DNS:
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
# /etc/nginx/sites-available/your-site
|
||||||
|
upstream backend {
|
||||||
|
# Docker Compose assigns container IPs dynamically.
|
||||||
|
# Use a resolver to look up the service name.
|
||||||
|
server 127.0.0.1:3000;
|
||||||
|
server 127.0.0.1:3010; # second replica on different host port
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, use Docker Compose port ranges:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
backend:
|
||||||
|
ports:
|
||||||
|
- "127.0.0.1:3000-3009:3000"
|
||||||
|
deploy:
|
||||||
|
replicas: 2
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.2 Connection pool considerations
|
||||||
|
|
||||||
|
Each backend container runs up to 4 workers, each with its own connection
|
||||||
|
pool. With the default pool size of 30:
|
||||||
|
|
||||||
|
| Replicas | Workers | Max DB connections |
|
||||||
|
|----------|---------|-------------------|
|
||||||
|
| 1 | 4 | 120 |
|
||||||
|
| 2 | 8 | 240 |
|
||||||
|
| 3 | 12 | 360 |
|
||||||
|
|
||||||
|
If using managed PostgreSQL, ensure `max_connections` on the DB is high
|
||||||
|
enough. For > 2 replicas, consider adding **PgBouncer** as a connection
|
||||||
|
pooler (transaction-mode pooling) to multiplex connections:
|
||||||
|
|
||||||
|
```
|
||||||
|
Backend workers (12) → PgBouncer (50 server connections) → PostgreSQL
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3.3 Session and state considerations
|
||||||
|
|
||||||
|
The application currently uses **stateless JWT authentication** — no
|
||||||
|
server-side sessions. This means backend replicas can handle any request
|
||||||
|
without sticky sessions. Redis is used for caching only. This architecture
|
||||||
|
is already horizontal-ready.
|
||||||
|
|
||||||
|
### Phase 3 capacity estimate
|
||||||
|
|
||||||
|
| Replicas | Concurrent users | API req/sec |
|
||||||
|
|----------|-----------------|-------------|
|
||||||
|
| 2 | 500–1,000 | 800–1,500 |
|
||||||
|
| 3 | 1,000–2,000 | 1,500–2,500 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Full Horizontal (Multi-Node)
|
||||||
|
|
||||||
|
**When:** Single VM resources exhausted, or you need geographic distribution
|
||||||
|
and high availability.
|
||||||
|
|
||||||
|
### 4.1 Docker Swarm (simplest multi-node)
|
||||||
|
|
||||||
|
Docker Swarm is the easiest migration from Docker Compose. The compose
|
||||||
|
files are already compatible:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On the manager node
|
||||||
|
docker swarm init
|
||||||
|
|
||||||
|
# On worker nodes
|
||||||
|
docker swarm join --token <token> <manager-ip>:2377
|
||||||
|
|
||||||
|
# Deploy the stack
|
||||||
|
docker stack deploy -c docker-compose.yml -c docker-compose.prod.yml hoaledgeriq
|
||||||
|
```
|
||||||
|
|
||||||
|
Scale the backend across nodes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker service scale hoaledgeriq_backend=4
|
||||||
|
```
|
||||||
|
|
||||||
|
Swarm handles load balancing across nodes via its built-in ingress network.
|
||||||
|
|
||||||
|
### 4.2 Kubernetes (full orchestration)
|
||||||
|
|
||||||
|
For larger deployments, migrate to Kubernetes:
|
||||||
|
|
||||||
|
- **Backend:** Deployment with HPA (Horizontal Pod Autoscaler) on CPU
|
||||||
|
- **Frontend:** Deployment with 2+ replicas behind a Service
|
||||||
|
- **PostgreSQL:** External managed service (not in the cluster)
|
||||||
|
- **Redis:** External managed service or StatefulSet
|
||||||
|
- **Ingress:** nginx-ingress or cloud load balancer
|
||||||
|
|
||||||
|
This is a significant migration but provides auto-scaling, self-healing,
|
||||||
|
rolling deployments, and multi-region capability.
|
||||||
|
|
||||||
|
### 4.3 CDN for static assets
|
||||||
|
|
||||||
|
At any point in the scaling journey, a CDN provides the biggest return on
|
||||||
|
investment for frontend performance:
|
||||||
|
|
||||||
|
- **Cloudflare** (free tier works): Proxy DNS, caches static assets at edge
|
||||||
|
- **AWS CloudFront** or **GCP Cloud CDN**: More control, ~$0.085/GB
|
||||||
|
|
||||||
|
This eliminates nearly all load on the frontend nginx container and reduces
|
||||||
|
latency for geographically distributed users. Static assets (JS, CSS,
|
||||||
|
images) are served from edge nodes instead of your VM.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Component-by-Component Scaling Reference
|
||||||
|
|
||||||
|
### Backend (NestJS)
|
||||||
|
|
||||||
|
| Approach | When | How |
|
||||||
|
|----------|------|-----|
|
||||||
|
| Tune worker count | CPU underused | Set `WORKERS` env var or modify `main.ts` cap |
|
||||||
|
| Increase memory limit | OOM or >80% usage | Raise `deploy.resources.limits.memory` |
|
||||||
|
| Add replicas | CPU maxed at 4 workers | `deploy.replicas: N` in compose |
|
||||||
|
| Move to separate VM | VM resources exhausted | Run backend on dedicated compute |
|
||||||
|
|
||||||
|
**Current clustering logic** (from `backend/src/main.ts`):
|
||||||
|
- Production: `Math.min(os.cpus().length, 4)` workers
|
||||||
|
- Development: 1 worker
|
||||||
|
- To allow more than 4 workers, change the cap in `main.ts`
|
||||||
|
|
||||||
|
### PostgreSQL
|
||||||
|
|
||||||
|
| Approach | When | How |
|
||||||
|
|----------|------|-----|
|
||||||
|
| Increase shared_buffers | Cache hit ratio < 99% | Tune postgres command args |
|
||||||
|
| Increase max_connections | Pool exhaustion errors | Increase in postgres command + add PgBouncer |
|
||||||
|
| Add read replica | Read-heavy workload | Managed DB feature or streaming replication |
|
||||||
|
| Vertical scale | Query latency high | Larger managed DB instance |
|
||||||
|
|
||||||
|
**Key queries to monitor:**
|
||||||
|
```sql
|
||||||
|
-- Connection usage
|
||||||
|
SELECT count(*) AS active, max_conn FROM pg_stat_activity,
|
||||||
|
(SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') s
|
||||||
|
GROUP BY max_conn;
|
||||||
|
|
||||||
|
-- Cache hit ratio (should be > 99%)
|
||||||
|
SELECT
|
||||||
|
sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
|
||||||
|
FROM pg_statio_user_tables;
|
||||||
|
|
||||||
|
-- Slow queries (if pg_stat_statements is enabled)
|
||||||
|
SELECT query, mean_exec_time, calls
|
||||||
|
FROM pg_stat_statements
|
||||||
|
ORDER BY mean_exec_time DESC
|
||||||
|
LIMIT 10;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Redis
|
||||||
|
|
||||||
|
| Approach | When | How |
|
||||||
|
|----------|------|-----|
|
||||||
|
| Increase maxmemory | Evictions happening frequently | Change `--maxmemory` in compose command |
|
||||||
|
| Move to managed | Need persistence guarantees | AWS ElastiCache / DigitalOcean Managed Redis |
|
||||||
|
| Add replica | Read-heavy caching | Managed service with read replicas |
|
||||||
|
|
||||||
|
### Host Nginx
|
||||||
|
|
||||||
|
| Approach | When | How |
|
||||||
|
|----------|------|-----|
|
||||||
|
| Tune worker_connections | Connection refused errors | Increase in `/etc/nginx/nginx.conf` |
|
||||||
|
| Add upstream servers | Multiple backend replicas | upstream block with multiple servers |
|
||||||
|
| Move to load balancer | Multi-node deployment | Cloud LB (ALB, GCP LB) or HAProxy |
|
||||||
|
| Add CDN | Static asset latency | Cloudflare, CloudFront, etc. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Docker Daemon Tuning
|
||||||
|
|
||||||
|
These settings are applied on the host in `/etc/docker/daemon.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"userland-proxy": false,
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {
|
||||||
|
"max-size": "50m",
|
||||||
|
"max-file": "3"
|
||||||
|
},
|
||||||
|
"default-ulimits": {
|
||||||
|
"nofile": {
|
||||||
|
"Name": "nofile",
|
||||||
|
"Hard": 65536,
|
||||||
|
"Soft": 65536
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
| Setting | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `userland-proxy: false` | Kernel-level port forwarding instead of userspace Go proxy (already applied) |
|
||||||
|
| `log-opts` | Prevents Docker container logs from filling the disk |
|
||||||
|
| `default-ulimits.nofile` | Raises file descriptor limit for containers handling many connections |
|
||||||
|
|
||||||
|
After changing, restart Docker: `sudo systemctl restart docker`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring with New Relic
|
||||||
|
|
||||||
|
New Relic is deployed on the backend via the conditional preload
|
||||||
|
(`NEW_RELIC_ENABLED=true` in `.env`). Key dashboards to set up:
|
||||||
|
|
||||||
|
### Alerts to configure
|
||||||
|
|
||||||
|
| Alert | Condition | Priority |
|
||||||
|
|-------|-----------|----------|
|
||||||
|
| High error rate | > 1% for 5 minutes | Critical |
|
||||||
|
| Slow transactions | p95 > 2s for 5 minutes | Critical |
|
||||||
|
| Apdex score drop | < 0.7 for 10 minutes | Warning |
|
||||||
|
| Memory usage | > 80% of container limit for 10 minutes | Warning |
|
||||||
|
| Transaction throughput drop | > 50% decrease vs. baseline | Warning |
|
||||||
|
|
||||||
|
### Key transactions to monitor
|
||||||
|
|
||||||
|
| Endpoint | Why |
|
||||||
|
|----------|-----|
|
||||||
|
| `POST /api/auth/login` | Authentication performance, first thing every user hits |
|
||||||
|
| `GET /api/journal-entries` | Heaviest read query (double-entry bookkeeping with lines) |
|
||||||
|
| `POST /api/investment-planning/recommendations` | AI endpoint, 30–180s response time, external dependency |
|
||||||
|
| `GET /api/reports/*` | Financial reports with aggregate queries |
|
||||||
|
| `GET /api/projects` | Includes real-time funding computation across all reserve projects |
|
||||||
|
|
||||||
|
### Infrastructure metrics to export
|
||||||
|
|
||||||
|
If you later add the New Relic Infrastructure agent to the host VM,
|
||||||
|
you can correlate application performance with system metrics:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install on the host (not in Docker)
|
||||||
|
curl -Ls https://download.newrelic.com/install/newrelic-cli/scripts/install.sh | bash
|
||||||
|
sudo NEW_RELIC_API_KEY=<your-key> NEW_RELIC_ACCOUNT_ID=<your-id> \
|
||||||
|
/usr/local/bin/newrelic install -n infrastructure-agent-installer
|
||||||
|
```
|
||||||
|
|
||||||
|
This provides host-level CPU, memory, disk, and network metrics alongside
|
||||||
|
your application telemetry.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference — Scaling Decision Tree
|
||||||
|
|
||||||
|
```
|
||||||
|
Is API response time (p95) > 500ms?
|
||||||
|
├── Yes → Is backend CPU > 80%?
|
||||||
|
│ ├── Yes → Phase 1: Already at 4 workers?
|
||||||
|
│ │ ├── Yes → Phase 3: Add backend replicas
|
||||||
|
│ │ └── No → Raise worker cap in main.ts
|
||||||
|
│ └── No → Is PostgreSQL slow?
|
||||||
|
│ ├── Yes → Phase 1: Tune PG memory, or Phase 2: Managed DB
|
||||||
|
│ └── No → Profile the slow endpoints in New Relic
|
||||||
|
├── No → Is memory > 80% on any container?
|
||||||
|
│ ├── Yes → Phase 1: Raise memory limits (you have 21+ GB free)
|
||||||
|
│ └── No → Is disk > 80%?
|
||||||
|
│ ├── Yes → Clean Docker images, tune PG WAL retention, add log rotation
|
||||||
|
│ └── No → No scaling needed
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user