Files
HOA_Financial_Platform/docs/shadow-ai-benchmarking-plan.md
JoeBot 5ae6f672be feat: add shadow AI benchmarking for admin model comparison
Add a new admin-only feature that allows the platform owner to benchmark
the production AI model against up to 2 alternate models (any OpenAI-compatible
API) using real tenant data, without impacting users.

Backend:
- Shared AI caller utility (ai-caller.ts) for OpenAI-compatible endpoints
- Shadow AI module with service, controller, and 3 entities
- 6 admin API endpoints for model config CRUD, run trigger, and history
- Auto-creates shadow_ai_models, shadow_runs, shadow_run_results tables
- Exposes health-scores and investment-planning prompt builders for reuse

Frontend:
- New admin page at /admin/shadow-ai with 3 tabs:
  - Model Configuration (production + 2 alternate slots)
  - Run Comparison (tenant select, feature select, side-by-side results)
  - History (filterable run log with detail drill-down)
- Full side-by-side output display with diff highlighting
- Sidebar navigation link for AI Benchmarking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-05 07:16:38 -04:00

276 lines
12 KiB
Markdown

# Shadow AI Benchmarking Feature
## Context
The platform uses a single AI model (Qwen 3.5 via NVIDIA NIM) for three features: Operating Health Score, Reserve Health Score, and Investment Recommendations. The platform owner needs a way to evaluate alternate models (different providers, different versions) against the production model using real tenant data — without impacting users. This enables informed model migration decisions by comparing outputs side-by-side.
## Architecture Overview
- **New admin page** at `/admin/shadow-ai` with model configuration, run trigger, and history
- **New backend module** `shadow-ai` with controller, service, and 3 entities
- **3 new DB tables** in the `shared` schema for model configs, runs, and results
- **Shared AI caller utility** to avoid duplicating HTTP logic
- **Minimal changes** to existing services: make prompt-building methods public and export modules
---
## Phase 1: Shared AI Caller Utility
### New file: `backend/src/common/utils/ai-caller.ts`
Extract the HTTP POST logic (currently duplicated in both `callAI()` methods) into a reusable function:
```typescript
export async function callOpenAICompatible(params: {
apiUrl: string;
apiKey: string;
model: string;
messages: Array<{ role: string; content: string }>;
temperature: number;
maxTokens: number;
timeoutMs?: number; // default 600000
}): Promise<{
content: string; // cleaned JSON string (fences + <think> stripped)
usage?: { prompt_tokens: number; completion_tokens: number; total_tokens: number };
responseTimeMs: number;
}>
```
Handles: HTTPS POST to `{apiUrl}/chat/completions`, timeout, markdown fence stripping, `<think>` block removal, timing.
## Phase 2: Expose Existing Prompt Builders
### `backend/src/modules/health-scores/health-scores.service.ts`
- Change `private``public` on:
- `gatherOperatingData(qr)` (line 252)
- `gatherReserveData(qr)` (line 523)
- `buildOperatingPrompt(data)` (line 790)
- `buildReservePrompt(data)` (line 930)
- `checkDataReadiness(qr, scoreType)` (used to validate data exists)
### `backend/src/modules/health-scores/health-scores.module.ts`
- Add `exports: [HealthScoresService]`
### `backend/src/modules/investment-planning/investment-planning.service.ts`
- Add new public method `buildPromptForSchema(schemaName: string)` that:
1. Creates a query runner, sets `search_path` to the tenant schema
2. Runs the same data-gathering queries (financial snapshot, market rates, monthly forecast) using the query runner directly (bypassing request-scoped `TenantService`)
3. Calls the existing `buildPromptMessages()` with gathered data
4. Returns `Array<{ role: string; content: string }>`
- Change `buildPromptMessages()` from `private``public` (line 880)
### `backend/src/modules/investment-planning/investment-planning.module.ts`
- Add `exports: [InvestmentPlanningService]`
## Phase 3: Database Tables & Entities
### 3 new tables in `shared` schema
**`shared.shadow_ai_models`** — Alternate model configurations (slots A and B)
| Column | Type | Notes |
|--------|------|-------|
| id | UUID PK | |
| slot | VARCHAR(10) | CHECK IN ('A', 'B'), UNIQUE |
| name | VARCHAR(100) | Display label |
| api_url | VARCHAR(500) | OpenAI-compatible endpoint |
| api_key | VARCHAR(500) | Bearer token |
| model_name | VARCHAR(200) | Model identifier |
| is_active | BOOLEAN | Default true |
| created_at | TIMESTAMPTZ | |
| updated_at | TIMESTAMPTZ | |
**`shared.shadow_runs`** — One row per comparison execution
| Column | Type | Notes |
|--------|------|-------|
| id | UUID PK | |
| tenant_id | UUID FK | → shared.organizations |
| feature | VARCHAR(30) | CHECK IN ('operating_health', 'reserve_health', 'investment_recommendations') |
| status | VARCHAR(20) | CHECK IN ('running', 'completed', 'partial', 'failed') |
| triggered_by | UUID FK | → shared.users |
| prompt_messages | JSONB | Exact messages sent to all models (proof of identical input) |
| started_at | TIMESTAMPTZ | |
| completed_at | TIMESTAMPTZ | |
| created_at | TIMESTAMPTZ | |
**`shared.shadow_run_results`** — One row per model per run (up to 3 per run)
| Column | Type | Notes |
|--------|------|-------|
| id | UUID PK | |
| run_id | UUID FK | → shadow_runs ON DELETE CASCADE |
| model_role | VARCHAR(20) | CHECK IN ('production', 'alternate_a', 'alternate_b'), UNIQUE(run_id, model_role) |
| model_name | VARCHAR(200) | Snapshot of model used |
| api_url | VARCHAR(500) | Snapshot of endpoint used |
| raw_response | TEXT | Unprocessed AI response |
| parsed_response | JSONB | Validated structured output |
| response_time_ms | INTEGER | |
| token_usage | JSONB | { prompt_tokens, completion_tokens, total_tokens } |
| status | VARCHAR(20) | CHECK IN ('pending', 'running', 'success', 'error') |
| error_message | TEXT | |
| created_at | TIMESTAMPTZ | |
### Entity files
- `backend/src/modules/shadow-ai/entities/shadow-ai-model.entity.ts`
- `backend/src/modules/shadow-ai/entities/shadow-run.entity.ts`
- `backend/src/modules/shadow-ai/entities/shadow-run-result.entity.ts`
All use `@Entity({ schema: 'shared', name: '...' })` pattern.
## Phase 4: Shadow AI Backend Module
### New directory: `backend/src/modules/shadow-ai/`
### `shadow-ai.service.ts`
**Model CRUD:**
- `getModels()` — Return both slots, mask API keys (show last 4 chars)
- `upsertModel(slot, dto)` — INSERT/UPDATE config for slot A or B
- `deleteModel(slot)` — Remove model config
**Run Execution:**
- `triggerRun(tenantId, feature, userId)`:
1. Look up tenant `schema_name` from `shared.organizations`
2. Build prompt messages by calling the appropriate exposed method:
- `operating_health`: Create query runner → set search_path → `healthScoresService.gatherOperatingData(qr)``healthScoresService.buildOperatingPrompt(data)`
- `reserve_health`: Same pattern with reserve methods
- `investment_recommendations`: `investmentPlanningService.buildPromptForSchema(schemaName)`
3. Insert `shadow_runs` row with `prompt_messages` stored as JSONB
4. Get production config from env vars, alternate configs from DB
5. Insert 1-3 `shadow_run_results` rows as 'pending' (production + active alternates)
6. Return `{ runId }` immediately
7. Fire-and-forget: call all models in parallel using `callOpenAICompatible()`
- Per feature: operating/reserve use temp 0.1, max_tokens 2048; investment uses temp 0.3, max_tokens 4096
8. Update each result row as it completes (success/error, parsed response, timing)
9. Update run status when all complete
**History:**
- `getRunHistory(page, limit, tenantFilter?, featureFilter?)` — Paginated list with tenant name JOIN
- `getRunDetail(runId)` — Full run + all results
### `shadow-ai.controller.ts`
All endpoints use `@UseGuards(JwtAuthGuard)` + `requireSuperadmin(req)` pattern from `admin.controller.ts`.
| Method | Path | Body/Params |
|--------|------|-------------|
| GET | `/admin/shadow-ai/models` | — |
| PUT | `/admin/shadow-ai/models/:slot` | `{ name, apiUrl, apiKey, modelName, isActive }` |
| DELETE | `/admin/shadow-ai/models/:slot` | — |
| POST | `/admin/shadow-ai/runs` | `{ tenantId, feature }` |
| GET | `/admin/shadow-ai/runs` | `?page&limit&tenantId&feature` |
| GET | `/admin/shadow-ai/runs/:id` | — |
### `shadow-ai.module.ts`
```typescript
@Module({
imports: [
TypeOrmModule.forFeature([ShadowAiModel, ShadowRun, ShadowRunResult]),
HealthScoresModule,
InvestmentPlanningModule,
UsersModule,
],
controllers: [ShadowAiController],
providers: [ShadowAiService],
})
```
### Register in `backend/src/app.module.ts`
- Add `import { ShadowAiModule }` and include in the `imports` array
## Phase 5: Frontend — Admin Shadow AI Page
### New file: `frontend/src/pages/admin/AdminShadowAiPage.tsx`
**Layout**: Mantine `Tabs` with 3 tabs
#### Tab 1: "Model Configuration"
- Three `Card` components in a `SimpleGrid cols={3}`:
- **Production** (read-only): Shows model name, API URL from a dedicated endpoint or hardcoded label "From environment config"
- **Alternate A**: Form with `TextInput` (name, API URL, model name), `PasswordInput` (API key), `Switch` (active), Save/Delete buttons
- **Alternate B**: Same form
- Fetches via `GET /api/admin/shadow-ai/models`
- Saves via `PUT /api/admin/shadow-ai/models/A` or `/B`
#### Tab 2: "Run Comparison"
- `Select` dropdown for tenant (reuse `GET /api/admin/organizations` already used by AdminPage)
- `Select` for feature type (Operating Health / Reserve Health / Investment Recommendations)
- `Button` "Run Shadow Comparison"
- On trigger: `POST /api/admin/shadow-ai/runs` → get `runId`
- Poll `GET /api/admin/shadow-ai/runs/:id` every 3s via `refetchInterval` until status !== 'running'
- Show per-model progress indicators during run
- Once complete, render results using shared comparison component (below)
#### Tab 3: "History"
- `Table` with columns: Date, Tenant, Feature, Status (Badge), Duration
- Filter controls: tenant Select, feature Select
- Click row → expand detail or modal showing full comparison
- Pagination
#### Shared Component: Side-by-Side Results Display
- `SimpleGrid cols={3}` (or fewer columns if only some models were configured)
- Each column:
- Header: model name + response time `Badge`
- **For health scores**: Score with `RingProgress`, label `Badge`, summary text, factors list (color-coded by impact), recommendations list (color-coded by priority)
- **For investment**: Overall assessment text, recommendation cards with type/priority badges, risk notes
- Collapsible raw JSON via `Accordion`
- **Diff highlighting**: Where parsed values differ across models, apply subtle background highlight (e.g., `yellow.0` in Mantine theme). Simple recursive comparison of JSON keys/values.
### Route addition: `frontend/src/App.tsx`
Within the `/admin` route group (after `<Route index element={<AdminPage />} />`):
```tsx
<Route path="shadow-ai" element={<AdminShadowAiPage />} />
```
### Sidebar nav: `frontend/src/components/layout/Sidebar.tsx`
In the `isAdminOnly` section (after the "Admin Panel" NavLink, around line 134):
```tsx
<NavLink
label="AI Benchmarking"
leftSection={<IconScale size={18} />}
active={location.pathname === '/admin/shadow-ai'}
onClick={() => go('/admin/shadow-ai')}
color="violet"
/>
```
## Implementation Order
1. **`ai-caller.ts`** — Shared utility (no dependencies)
2. **Health scores + investment planning** — Make methods public, add exports, add `buildPromptForSchema`
3. **Entities** — 3 TypeORM entity files
4. **Service + Controller + Module** — Shadow AI backend
5. **Register module** in `app.module.ts`
6. **Frontend page**`AdminShadowAiPage.tsx`
7. **Route + Sidebar** — Wire up navigation
## Verification
1. **Backend**: Start server, confirm no TypeORM errors for new entities
2. **Model config**: Use admin UI to save/load/delete alternate model configs
3. **Run comparison**: Select a tenant, trigger a run, verify all 3 models are called with identical prompts
4. **Results display**: Confirm side-by-side output renders correctly for all 3 feature types
5. **History**: Verify past runs are persisted and browsable
6. **Auth**: Confirm non-superadmin users get 403 on all shadow-ai endpoints
7. **Production safety**: Verify no changes to production AI behavior — shadow runs are completely isolated
## Key Files to Modify
- `backend/src/modules/health-scores/health-scores.service.ts` — Make 5 methods public
- `backend/src/modules/health-scores/health-scores.module.ts` — Add exports
- `backend/src/modules/investment-planning/investment-planning.service.ts` — Add `buildPromptForSchema()`, make `buildPromptMessages()` public
- `backend/src/modules/investment-planning/investment-planning.module.ts` — Add exports
- `backend/src/app.module.ts` — Register ShadowAiModule
- `frontend/src/App.tsx` — Add route
- `frontend/src/components/layout/Sidebar.tsx` — Add nav item
## New Files
- `backend/src/common/utils/ai-caller.ts`
- `backend/src/modules/shadow-ai/shadow-ai.module.ts`
- `backend/src/modules/shadow-ai/shadow-ai.service.ts`
- `backend/src/modules/shadow-ai/shadow-ai.controller.ts`
- `backend/src/modules/shadow-ai/entities/shadow-ai-model.entity.ts`
- `backend/src/modules/shadow-ai/entities/shadow-run.entity.ts`
- `backend/src/modules/shadow-ai/entities/shadow-run-result.entity.ts`
- `frontend/src/pages/admin/AdminShadowAiPage.tsx`