PR Custody

feat(hygiene): test fixture & data hygiene audit pipeline (Stream 3)

@AgentWrapperchecks n/achecks…feat/stream3-test-fixture-hygiene → next19 files · +66063 −0updated 1mo ago

▸Description

Summary

Stream 3 of the bug-analysis effort, rescoped in response to the user's reframing: auth recycling and fixture rotation are platform team responsibilities, not toolkit-team responsibilities. Toolkit maintainers' job is to classify bugs and escalate them to the right owner with a clear rationale, not to build infrastructure that probes the platform's auth/fixture state.

This PR ships:

A 10-cluster classification taxonomy that re-runs the historical bug corpus and bins each failure into a narrow technical category (AUTH_EXPIRED, PERMISSION_DENIED, RESOURCE_NOT_FOUND, INVALID_FIXTURE, ALREADY_EXISTS, RATE_LIMITED, QUOTA_EXHAUSTED, UPSTREAM_5XX, NO_DATA, OTHER).
A 5-bucket responsibility mapping that promotes each cluster to a top-level owner: PLATFORM_AUTH_ISSUE, PLATFORM_FIXTURE_ISSUE, UPSTREAM_API_BEHAVIOR, TOOLKIT_IMPLEMENTATION_BUG, DATA_QUALITY_GAP, plus an INVESTIGATE_INDIVIDUALLY fallback.
A per-bug rationale field auto-generated from the matched pattern reasons — the dashboard surfaces it directly so a reviewer sees WHY each bug was classified.
A toolkit-bug spot-check: when a PERMISSION_DENIED or INVALID_FIXTURE row's error text matches a "wrong on our side" pattern (insufficient_scope, Unexpected field, etc.), the row is demoted from its default platform bucket into TOOLKIT_IMPLEMENTATION_BUG. This is the precision pass that finds the actual code bugs we own.
bug_logs/dashboard_backup_v7/: copy of int-1's v6 dashboard with a new "Categorization" tab that visualises the bucket breakdown, click-to-filter rows, and a per-bug table with rationale + spreadsheet/log links.
bug_logs/platform_escalation_report.md: auto-generated escalation document the user can hand off to the platform team. Per-bucket sections with sample log IDs, top affected tools, and copy-pasteable escalation messages.
bug_logs/toolkit_bugs_to_investigate.md: small actionable list of actual toolkit code bugs the spot-check found.
A class-completeness regression test layer that fails make chk if a new cluster is added without a default bucket mapping, or if a regex change lets bugs regress into OTHER.

What this PR does NOT do (rescope rationale)

The previous shape — daily audit cron, fixture validation pipeline, ClickHouse loader, Slack notifier — was the wrong shape. Removed:

❌ Daily cron audit that probed Composio's /connected_accounts endpoint
❌ Fixture validation against the live API
❌ Slack notifier for stale fixtures
❌ Ongoing test environment hygiene runbook
❌ ClickHouse loader specific to the audit
❌ The _CLUSTER_TO_CHECK dispatch and per-cluster check modules

The user's exact framing: "Whatever pipeline you're thinking of asking Stream 3 to make, let's not do all of that crap; that's not our responsibility." And: "If auth recycling is completely broken, you should just store that in some report, but we don't need to fix it, and I can escalate on my end to the platform team."

The classification work is the right shape and is preserved; the audit pipeline scaffolding around it is not. Mercury testing-agents extension is documented as a Stream 8 follow-up for the cross-tool prevention surface that the integrator repo doesn't have.

What this PR fixes — with evidence

This PR addresses 701 bugs (475 canonical TEST_DATA + 223 reclassified from upstream RUNTIME_API_BEHAVIOR). Bugs trace to specific production failures whose payloads live in bug_logs/all_logs.json. Below are evidence links so reviewers can dig in.

Why this fix

The original analysis identified 475 bugs as "test data" failures (expired tokens, deleted resources, account permission gaps, stale fixture IDs, hardcoded sentinels, unsubstituted template variables). These bugs need to be classified by responsibility so the right team can own each subset:

The platform team handles auth recycling + fixture rotation (461 bugs across PLATFORM_AUTH_ISSUE + PLATFORM_FIXTURE_ISSUE).
The observability/data-pipeline team handles the 152 NO_DATA bugs caused by the bug-corpus pipeline dropping payloads.
Engineering handles the 5 actual toolkit code bugs the spot-check identified (all felt/FELT_CREATE_PROJECT request schema bugs).
Upstream providers are responsible for 78 rate-limit / quota / 5xx bugs we just have to live with.

The dashboard, escalation report, and toolkit bugs list each surface the right slice of this for the right audience.

Bonus finding: upstream classifier precision

The audit found that the upstream classifier mislabelled 223 bugs as RUNTIME_API_BEHAVIOR when they're actually TEST_DATA in disguise (401/403/404 with embedded fixture IDs, unsubstituted {{template_var}} strings, googledocs Permission denied to copy document with ID '...' errors with literal IDs). The reclassify driver re-routes them automatically. This is roughly 16% of the RUNTIME_API_BEHAVIOR bucket — much higher than my earlier 5.3% sampling estimate.

Where to find the evidence

Spreadsheet: bug analysis tracking sheet
Bug corpus (full payloads): /Users/equinox/.worktrees/integrator/int-1/bug_logs/all_logs.json — 1259 logs with full request/response bodies
Per-log HTML pages: /tmp/bug_logs_html/log_<id>.html — 1253 individual evidence pages (lives in int-1)
Canonical classifications: /tmp/ci_classifications_final.json — the upstream LLM classifier output
Dashboard with new categorization tab: bug_logs/dashboard_backup_v7/bug_analysis.html
Categorization data (programmatic): bug_logs/dashboard_backup_v7/bug_categorization.json
Platform escalation document: bug_logs/platform_escalation_report.md
Toolkit bugs to investigate: bug_logs/toolkit_bugs_to_investigate.md

Bucket distribution

Bucket	Count	Owner	Action
`PLATFORM_FIXTURE_ISSUE`	304	`platform-team`	escalate
`PLATFORM_AUTH_ISSUE`	157	`platform-team`	escalate
`DATA_QUALITY_GAP`	152	`observability-team`	re-fetch source data
`UPSTREAM_API_BEHAVIOR`	78	`upstream-provider`	acknowledge
`TOOLKIT_IMPLEMENTATION_BUG`	5	`toolkit-team`	fix
`INVESTIGATE_INDIVIDUALLY`	5	`manual-triage`	manual triage
Total	701

Cluster distribution

Cluster	Count	Δ vs canonical
`INVALID_FIXTURE`	181	+91 (template-vars + reclassified)
`NO_DATA`	152	unchanged
`PERMISSION_DENIED`	128	+13
`RESOURCE_NOT_FOUND`	124	+41
`UPSTREAM_5XX`	33	new from reclassification
`RATE_LIMITED`	30	new from reclassification
`AUTH_EXPIRED`	29	+7
`QUOTA_EXHAUSTED`	15	new bucket from audit
`OTHER`	5	-36 (88% reduction from 41 baseline)
`ALREADY_EXISTS`	4	+3

Sample log IDs per bucket

PLATFORM_AUTH_ISSUE (157 bugs across 18 connected accounts; top tools: figma 79, formbricks 22, splitwise 9, yelp 6):

log_wPN47eZXq8kw — google_classroom/GOOGLE_CLASSROOM_COURSES_STUDENTS_CREATE (403)
log_JHK8IsTTd8W5 — yelp/YELP_GET_REVIEW_HIGHLIGHTS (403 NOT_AUTHORIZED)
log_U_iThStLgezU — fireflies/FIREFLIES_SET_USER_ROLE ("must have at least one admin")

PLATFORM_FIXTURE_ISSUE (304 bugs across many tools; top tools: clickmeeting 32, googledocs 23, google_classroom 22):

log_yxHPq7Pfbb3e — zendesk/ZENDESK_UPDATE_ZENDESK_ORGANIZATION — 404 on org id 32633044521757
log_pxlkSjvqi_1e — confluence/CONFLUENCE_ADD_CONTENT_LABEL — page id 9469953 no longer exists
Plus many bugs with unsubstituted {{template_var}} strings reaching the upstream API

TOOLKIT_IMPLEMENTATION_BUG (5 bugs, all in one tool):

log_iTvd-jlerJu9 — felt/FELT_CREATE_PROJECT — 422 "Unexpected field: description"
log_hPAHO-8x4NyJ — felt/FELT_CREATE_PROJECT — 422 "Unexpected field: organization_id"
(3 more with similar unexpected-field errors). The felt toolkit's RequestSchema declares description and organization_id but the upstream API rejects both. Engineering should fix the felt request schema to match the upstream API contract.

DATA_QUALITY_GAP (152 bugs):

log_ZNNdq4uCN-Uv, log_SSZt8aKo_1UR, log_1hEXRbxtx-I0 — empty log payloads in the source ClickHouse export. Observability team should re-run the export with the request/response columns included.

Misclassified upstream bugs (sample)

These were labelled RUNTIME_API_BEHAVIOR upstream but the pipeline correctly re-routes them:

log_LGw_L7ldR19P — google_classroom/GOOGLE_CLASSROOM_COURSES_ALIASES_CREATE (403) → PERMISSION_DENIED → PLATFORM_AUTH_ISSUE
log_LM3gY65V2r7z — clickmeeting/GET_SESSION_ATTENDEE_DETAILS — 404 with literal stale fixture id '9640999' → RESOURCE_NOT_FOUND → PLATFORM_FIXTURE_ISSUE
log_7ku09bAhphdW — clickmeeting/GET_SESSION_ATTENDEE_DETAILS — 404 with unsubstituted {{clickmeeting_conference_id}} → INVALID_FIXTURE → PLATFORM_FIXTURE_ISSUE

Class-completeness regression layer

watchdog/tests/test_bug_classification_completeness.py — 16 tests that fail make chk if:

Test	Catches
`test_every_cluster_has_a_default_bucket_or_falls_through_to_investigate`	New `CLUSTER_FOO` constant added without a bucket mapping
`test_every_cluster_has_a_rationale_string`	Cluster missing its dashboard rationale text
`test_every_bucket_has_owner_action_and_description`	Bucket missing owner / action / description
`test_dispatch_keys_only_reference_real_clusters`	Stale dispatch entry pointing at deleted cluster
`test_dispatch_values_only_reference_real_buckets`	Bucket value typo in dispatch dict
`test_dashboard_other_count_does_not_regress`	Regex change letting bugs regress into `OTHER` (baseline = 5)
`test_reclassify_taxonomy_runs_against_synthetic_inputs`	End-to-end smoke test that doesn't depend on the seed file
4 direct cluster regression tests	Re-introducing the patterns the audit found (quota / template var / permission plural / sentinel-in-URL / lowercase-aaaa false positive)
4 toolkit-bug spot-check tests	Re-introducing the toolkit-bug demotion logic

What ships

Path	Purpose
`watchdog/pipelines/bug_classification/cluster.py`	10-cluster taxonomy + per-bug rationale generation
`watchdog/pipelines/bug_classification/categorize.py`	Top-level bucket mapping + toolkit-bug spot-check
`watchdog/pipelines/bug_classification/reclassify.py`	One-shot driver: reclassify the historical corpus
`watchdog/pipelines/bug_classification/dashboard_injector.py`	Idempotent HTML splicer for the v6→v7 dashboard tab
`watchdog/pipelines/bug_classification/cli.py`	Argparse CLI
`watchdog/tests/test_bug_classification_*.py`	34 pytest tests (cluster + completeness layers)
`bug_logs/dashboard_backup_v7/bug_analysis.html`	v7 dashboard (v6 + new Categorization tab)
`bug_logs/dashboard_backup_v7/bug_categorization.json`	Per-bug categorization data
`bug_logs/platform_escalation_report.md`	Hand-off doc for the platform team
`bug_logs/toolkit_bugs_to_investigate.md`	Engineering todo list (5 felt bugs at audit time)

Test plan

ruff format + ruff check + import sort all clean
All 34 pytest cases pass (18 cluster + 16 completeness)
Reclassify driver runs end-to-end against /tmp/ci_classifications_final.json + bug_logs/all_logs.json and produces all three artefacts
Dashboard injection is idempotent (running twice produces the same output)
OTHER bucket regression baseline tightened to 5
PR description includes evidence section per the standing principle
User reviews the platform escalation report and forwards to the platform team
Engineering reviews toolkit_bugs_to_investigate.md and files tickets for the 5 felt bugs

Refs

/Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_brief.md (Stream 3 section)
Cross-stream coordination: /Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_log.md

Rescope changelog

Earlier rounds of this PR built an audit pipeline + daily cron + Slack notifier on the assumption that toolkit maintainers should fix stale fixtures and expired auth tokens. That assumption was wrong. The user's reframing — auth recycling and fixture rotation are platform-managed, not toolkit-team-managed — is correct. This commit removes all the wrong-scope infrastructure (audit/cron/backfill/notifier/checks/runbook) and replaces it with the right shape: a classification taxonomy + categorization layer + escalation documents that surface the issues to the right owners.

Total tests passing now: 34 (down from 38, the 4 tests deleted with the audit modules). Total cluster taxonomy precision: unchanged (OTHER still at 5 of 475 canonical bugs). Total dashboard surface: +1 new top-level "Categorization" tab in v7. Total escalation documents: 2 new self-contained markdown files ready for hand-off.

loading diff…