@linear[bot]3mo ago
Objective
Run 5% of all workflows on multiple alternative models alongside the default model (Claude Sonnet 4.5) to compare quality, latency, and cost. All experiment PRs should be marked do-not-merge.
Models to Test
| Model | Provider | Input $/M | Output $/M | Notes |
|---|---|---|---|---|
| Claude Sonnet 4.5 (baseline) | Anthropic | $3.00 | $15.00 | Current default |
| Kimi K2.5 | Fireworks (OpenAI-compatible) | $0.60 | $2.50 | 1T params, 262K context, ~80% cheaper than Claude |
| GLM-5 | ZAI | $1.00 | $3.20 | 744B MoE (40B active), 200K context, MIT license |
| GPT-5.3 Codex | OpenAI | $1.75 | $14.00 | SOTA on SWE-Bench Pro & Terminal-Bench 2.0, 400K context |
Implementation Approach
- 5% traffic routing: For each workflow run, randomly select 5% to also run on an alternative model
- Shadow experiment: Both default and alternative model run; alternative model PRs tagged
do-not-merge - No production impact: Experiment runs are purely for comparison, not replacing production output
- Metrics collection: Log model used, cost, latency, and quality scores for each run
Scope
- All workflows (create-app, fix-action, test-and-fix-action, finder, etc.)
- All GenericAgent instances within those workflows
- Exclude:
test_and_fix_agentandfix_action_agentfrom model swap (mission-critical, per Samvit) - Exclude:
response_schema_*agents (separate optimization track)