PR Custody

feat(classification): bug classifier workflow using Claude Agents SDK

@AgentWrapperchecks n/achecks…feat/stream3-claude-sdk-classifier → next21 files · +2842 −27updated 2w ago

▸Description· 2 comments

Summary

Bug classification workflow using claude_agent_sdk.query() for lightweight, stateless 3-way evidence cross-reference. This is the "how we classified" code that powers the classification results in PR #1381.

For each of 701 bugs, the workflow reads THREE evidence sources:

The reported error (from ci_classifications_final.json)
The full ClickHouse log payload (request/response bodies, headers, status codes)
The Mercury action source code (the Python execute() method)

Then classifies into 5 responsibility buckets:

STALE_TEST_DATA — non-issue, close the bug report
EXPIRED_AUTH — non-issue, re-auth to retest
UPSTREAM_API_BEHAVIOR — not our fault, acknowledge
TOOLKIT_CODE_BUG — real code bug, engineering fix
INVESTIGATE_INDIVIDUALLY — evidence insufficient

Architecture

File	Purpose
`cortex/agents/bug_classifier/agent.py`	Uses `query()` from claude_agent_sdk for one-shot classification
`cortex/agents/bug_classifier/prompt.py`	System prompt + user prompt template with 3-way evidence
`cortex/agents/bug_classifier/models.py`	Pydantic models (BugEvidence, BugClassification, BatchClassificationResponse)
`cortex/agents/bug_classifier/runner.py`	Workflow orchestrator: reads inputs, batches by (tool,action), resolves Mercury files, runs agent, writes output
`watchdog/pipelines/bug_classification/cluster.py`	10-cluster taxonomy (shared with #1381)
`watchdog/pipelines/bug_classification/categorize.py`	5-bucket responsibility mapping (shared with #1381)

Key design decisions

query() not ClaudeSDKClient — lightweight, stateless, no MCP server overhead. Each batch is an independent one-shot query using the logged-in Claude Code session. No separate API key.
Batches by (tool, action) — each Mercury action file is read once, then all bugs sharing that action are classified together (up to 8 per batch).
temperature=0 for deterministic, reproducible classifications.
Fuzzy file resolution — when the action enum → file path doesn't resolve exactly, uses difflib.get_close_matches against the directory listing.
DATA_QUALITY_GAP deterministic — bugs without log data skip the LLM entirely and get a deterministic classification.

Usage

python -m cortex.agents.bug_classifier.runner \
    --classifications /tmp/ci_classifications_final.json \
    --logs /path/to/all_logs.json \
    --mercury-root /Users/equinox/mercury \
    --output bug_logs/dashboard_backup_v7/bug_categorization.json \
    --tc-row-map /tmp/tc_row_map.json

Relationship to PR #1381

PR #1381 has the classification results (dashboard, reports, seed data). This PR has the classification methodology (the agent that produces those results). They share the taxonomy code (cluster.py, categorize.py) which is included in both PRs for self-containment.

The previous iteration used gpt-4o via OPENAI_API_KEY; this replaces it with Claude via the Claude Agents SDK, following the patterns established in cortex/agents/ and customer-support.

Test plan

ruff format + ruff check pass
Code compiles and imports successfully
End-to-end run against the 701-bug dataset (blocked on Claude API access from this session — the query() function needs a live Claude Code session to execute)
Verify bucket distribution is ballpark-consistent with the gpt-4o baseline (STALE_TEST_DATA ~221, TOOLKIT_CODE_BUG ~131, EXPIRED_AUTH ~98)

🤖 Generated with Claude Code

@AgentWrapper1mo ago

@cursor review

@cursor[bot]agent1mo ago

Skipping Bugbot: Bugbot is disabled for this repository. Visit the Bugbot dashboard to update your settings.

loading diff…