PR Custody

feat: Add alternative model experiments for A/B testing (INT-1337)

@AgentWrapperchecks n/achecks…feat/INT-1337-alternative-model-experiments → next8 files · +337 −29updated 3mo ago

▸Description · 1 noise

Why

We need to evaluate alternative models (Kimi K2.5, GLM-5, GPT-5.3 Codex) for cost and quality tradeoffs. This PR implements infrastructure to run 5% of workflows on alternative models for A/B testing.

What

Feature flags: Added configuration to enable alternative model experiments with 5% rollout
Model support: Integrated Kimi K2.5 (Fireworks), GLM-5 (ZAI), and GPT-5.3 Codex (OpenAI) with OpenAI-compatible clients
Smart routing: Automatically routes workflows to alternative models while excluding critical agents (test_and_fix_agent, fix_action_agent)
PR labeling: Automatically adds do-not-merge label to PRs created during experiments
Metrics tracking: Tags all metrics with alternative_model and stores selection in database

How to Test

1. Set feature flags in workflow config:

{
  "feature_flags": {
    "alternative_models_enabled": true,
    "alternative_models_rollout_pct": 100,  // Force selection for testing
    "alternative_models_list": ["kimi-k2.5"]
  }
}

2. Set required environment variables:

FIREWORKS_API_KEY for Kimi K2.5
ZAI_API_KEY for GLM-5
OPENAI_API_KEY for GPT-5.3 Codex

3. Trigger a workflow and verify:

Check logs for "Alternative model selected"
Verify PR has do-not-merge label
Check Datadog metrics for alternative_model tag
Query database for params_json.alternative_model

Pre-Review Checklist

I have self-reviewed this PR

Notes

This is infrastructure only - no workflows will use alternative models until feature flag is enabled
Agent exclusions ensure critical agents (test_and_fix, fix_action) always use default model
All PRs from experiments are tagged do-not-merge for manual review
The 5% rollout percentage is configurable via feature flags

Linear: INT-1337

▸ 1 bot/status comment hidden

@linear[bot]3mo ago

Objective

Run 5% of all workflows on multiple alternative models alongside the default model (Claude Sonnet 4.5) to compare quality, latency, and cost. All experiment PRs should be marked do-not-merge.

Models to Test

Model	Provider	Input $/M	Output $/M	Notes
Claude Sonnet 4.5 (baseline)	Anthropic	$3.00	$15.00	Current default
Kimi K2.5	Fireworks (OpenAI-compatible)	$0.60	$2.50	1T params, 262K context, ~80% cheaper than Claude
GLM-5	ZAI	$1.00	$3.20	744B MoE (40B active), 200K context, MIT license
GPT-5.3 Codex	OpenAI	$1.75	$14.00	SOTA on SWE-Bench Pro & Terminal-Bench 2.0, 400K context

Implementation Approach

5% traffic routing: For each workflow run, randomly select 5% to also run on an alternative model
Shadow experiment: Both default and alternative model run; alternative model PRs tagged do-not-merge
No production impact: Experiment runs are purely for comparison, not replacing production output
Metrics collection: Log model used, cost, latency, and quality scores for each run

Scope

All workflows (create-app, fix-action, test-and-fix-action, finder, etc.)
All GenericAgent instances within those workflows
Exclude: test_and_fix_agent and fix_action_agent from model swap (mission-critical, per Samvit)
Exclude: response_schema_* agents (separate optimization track)

loading diff…

What

Feature flags: Added configuration to enable alternative model experiments with 5% rollout

Model support: Integrated Kimi K2.5 (Fireworks), GLM-5 (ZAI), and GPT-5.3 Codex (OpenAI) with OpenAI-compatible clients

Smart routing: Automatically routes workflows to alternative models while excluding critical agents (test_and_fix_agent, fix_action_agent)

PR labeling: Automatically adds do-not-merge label to PRs created during experiments

Metrics tracking: Tags all metrics with alternative_model and stores selection in database

How to Test

1. Set feature flags in workflow config:

{ "feature_flags": { "alternative_models_enabled": true, "alternative_models_rollout_pct": 100, // Force selection for testing "alternative_models_list": ["kimi-k2.5"] } }

2. Set required environment variables:

FIREWORKS_API_KEY for Kimi K2.5

ZAI_API_KEY for GLM-5

OPENAI_API_KEY for GPT-5.3 Codex

3. Trigger a workflow and verify:

Check logs for "Alternative model selected"

Verify PR has do-not-merge label

Check Datadog metrics for alternative_model tag

Query database for params_json.alternative_model

Notes

This is infrastructure only - no workflows will use alternative models until feature flag is enabled

Agent exclusions ensure critical agents (test_and_fix, fix_action) always use default model

All PRs from experiments are tagged do-not-merge for manual review

The 5% rollout percentage is configurable via feature flags

Linear: INT-1337

Models to Test

Model

Provider

Input $/M

Output $/M

Notes

Claude Sonnet 4.5 (baseline)

Anthropic

$3.00

$15.00

Current default

Kimi K2.5

Fireworks (OpenAI-compatible)

$0.60

$2.50

1T params, 262K context, ~80% cheaper than Claude

GLM-5

ZAI

$1.00

$3.20

744B MoE (40B active), 200K context, MIT license

GPT-5.3 Codex

OpenAI

$1.75

$14.00

SOTA on SWE-Bench Pro & Terminal-Bench 2.0, 400K context

Implementation Approach

5% traffic routing: For each workflow run, randomly select 5% to also run on an alternative model

Shadow experiment: Both default and alternative model run; alternative model PRs tagged do-not-merge

No production impact: Experiment runs are purely for comparison, not replacing production output

Metrics collection: Log model used, cost, latency, and quality scores for each run

feat: Add alternative model experiments for A/B testing (INT-1337)

Why

What

How to Test

1. Set feature flags in workflow config:

2. Set required environment variables:

3. Trigger a workflow and verify:

Pre-Review Checklist

Notes

Objective

Models to Test

Implementation Approach

Scope

Why

What

How to Test

1. Set feature flags in workflow config:

2. Set required environment variables:

3. Trigger a workflow and verify:

Pre-Review Checklist

Notes

Objective

Models to Test

Implementation Approach

Scope

Implementation Details

Model Details

Kimi K2.5 (Moonshot AI)

GLM-5 (Zhipu AI)

GPT-5.3 Codex (OpenAI)

Success Criteria

Cost Context (2-week avg, excl response-schema)

References