Stream 3 of the bug-analysis effort, rescoped in response to the user's reframing: auth recycling and fixture rotation are platform team responsibilities, not toolkit-team responsibilities. Toolkit maintainers' job is to classify bugs and escalate them to the right owner with a clear rationale, not to build infrastructure that probes the platform's auth/fixture state.
This PR ships:
AUTH_EXPIRED, PERMISSION_DENIED, RESOURCE_NOT_FOUND, INVALID_FIXTURE, ALREADY_EXISTS, RATE_LIMITED, QUOTA_EXHAUSTED, UPSTREAM_5XX, NO_DATA, OTHER).PLATFORM_AUTH_ISSUE, PLATFORM_FIXTURE_ISSUE, UPSTREAM_API_BEHAVIOR, TOOLKIT_IMPLEMENTATION_BUG, DATA_QUALITY_GAP, plus an INVESTIGATE_INDIVIDUALLY fallback.PERMISSION_DENIED or INVALID_FIXTURE row's error text matches a "wrong on our side" pattern (insufficient_scope, Unexpected field, etc.), the row is demoted from its default platform bucket into TOOLKIT_IMPLEMENTATION_BUG. This is the precision pass that finds the actual code bugs we own.bug_logs/dashboard_backup_v7/: copy of int-1's v6 dashboard with a new "Categorization" tab that visualises the bucket breakdown, click-to-filter rows, and a per-bug table with rationale + spreadsheet/log links.bug_logs/platform_escalation_report.md: auto-generated escalation document the user can hand off to the platform team. Per-bucket sections with sample log IDs, top affected tools, and copy-pasteable escalation messages.bug_logs/toolkit_bugs_to_investigate.md: small actionable list of actual toolkit code bugs the spot-check found.make chk if a new cluster is added without a default bucket mapping, or if a regex change lets bugs regress into OTHER.The previous shape — daily audit cron, fixture validation pipeline, ClickHouse loader, Slack notifier — was the wrong shape. Removed:
/connected_accounts endpoint_CLUSTER_TO_CHECK dispatch and per-cluster check modulesThe user's exact framing: "Whatever pipeline you're thinking of asking Stream 3 to make, let's not do all of that crap; that's not our responsibility." And: "If auth recycling is completely broken, you should just store that in some report, but we don't need to fix it, and I can escalate on my end to the platform team."
The classification work is the right shape and is preserved; the audit pipeline scaffolding around it is not. Mercury testing-agents extension is documented as a Stream 8 follow-up for the cross-tool prevention surface that the integrator repo doesn't have.
This PR addresses 701 bugs (475 canonical TEST_DATA + 223 reclassified from upstream RUNTIME_API_BEHAVIOR). Bugs trace to specific production failures whose payloads live in bug_logs/all_logs.json. Below are evidence links so reviewers can dig in.
The original analysis identified 475 bugs as "test data" failures (expired tokens, deleted resources, account permission gaps, stale fixture IDs, hardcoded sentinels, unsubstituted template variables). These bugs need to be classified by responsibility so the right team can own each subset:
felt/FELT_CREATE_PROJECT request schema bugs).The dashboard, escalation report, and toolkit bugs list each surface the right slice of this for the right audience.
The audit found that the upstream classifier mislabelled 223 bugs as RUNTIME_API_BEHAVIOR when they're actually TEST_DATA in disguise (401/403/404 with embedded fixture IDs, unsubstituted {{template_var}} strings, googledocs Permission denied to copy document with ID '...' errors with literal IDs). The reclassify driver re-routes them automatically. This is roughly 16% of the RUNTIME_API_BEHAVIOR bucket — much higher than my earlier 5.3% sampling estimate.
/Users/equinox/.worktrees/integrator/int-1/bug_logs/all_logs.json — 1259 logs with full request/response bodies/tmp/bug_logs_html/log_<id>.html — 1253 individual evidence pages (lives in int-1)/tmp/ci_classifications_final.json — the upstream LLM classifier outputbug_logs/dashboard_backup_v7/bug_analysis.htmlbug_logs/dashboard_backup_v7/bug_categorization.jsonbug_logs/platform_escalation_report.mdbug_logs/toolkit_bugs_to_investigate.md| Bucket | Count | Owner | Action |
|---|---|---|---|
PLATFORM_FIXTURE_ISSUE | 304 | platform-team | escalate |
PLATFORM_AUTH_ISSUE | 157 | platform-team | escalate |
DATA_QUALITY_GAP | 152 | observability-team | re-fetch source data |
UPSTREAM_API_BEHAVIOR | 78 | upstream-provider | acknowledge |
TOOLKIT_IMPLEMENTATION_BUG | 5 | toolkit-team | fix |
INVESTIGATE_INDIVIDUALLY | 5 | manual-triage | manual triage |
| Total | 701 |
| Cluster | Count | Δ vs canonical |
|---|---|---|
INVALID_FIXTURE | 181 | +91 (template-vars + reclassified) |
NO_DATA | 152 | unchanged |
PERMISSION_DENIED | 128 | +13 |
RESOURCE_NOT_FOUND | 124 | +41 |
UPSTREAM_5XX | 33 | new from reclassification |
RATE_LIMITED | 30 | new from reclassification |
AUTH_EXPIRED | 29 | +7 |
QUOTA_EXHAUSTED | 15 | new bucket from audit |
OTHER | 5 | -36 (88% reduction from 41 baseline) |
ALREADY_EXISTS | 4 | +3 |
PLATFORM_AUTH_ISSUE (157 bugs across 18 connected accounts; top tools: figma 79, formbricks 22, splitwise 9, yelp 6):
log_wPN47eZXq8kw — google_classroom/GOOGLE_CLASSROOM_COURSES_STUDENTS_CREATE (403)log_JHK8IsTTd8W5 — yelp/YELP_GET_REVIEW_HIGHLIGHTS (403 NOT_AUTHORIZED)log_U_iThStLgezU — fireflies/FIREFLIES_SET_USER_ROLE ("must have at least one admin")PLATFORM_FIXTURE_ISSUE (304 bugs across many tools; top tools: clickmeeting 32, googledocs 23, google_classroom 22):
log_yxHPq7Pfbb3e — zendesk/ZENDESK_UPDATE_ZENDESK_ORGANIZATION — 404 on org id 32633044521757log_pxlkSjvqi_1e — confluence/CONFLUENCE_ADD_CONTENT_LABEL — page id 9469953 no longer exists{{template_var}} strings reaching the upstream APITOOLKIT_IMPLEMENTATION_BUG (5 bugs, all in one tool):
log_iTvd-jlerJu9 — felt/FELT_CREATE_PROJECT — 422 "Unexpected field: description"log_hPAHO-8x4NyJ — felt/FELT_CREATE_PROJECT — 422 "Unexpected field: organization_id"felt toolkit's RequestSchema declares description and organization_id but the upstream API rejects both. Engineering should fix the felt request schema to match the upstream API contract.DATA_QUALITY_GAP (152 bugs):
log_ZNNdq4uCN-Uv, log_SSZt8aKo_1UR, log_1hEXRbxtx-I0 — empty log payloads in the source ClickHouse export. Observability team should re-run the export with the request/response columns included.These were labelled RUNTIME_API_BEHAVIOR upstream but the pipeline correctly re-routes them:
log_LGw_L7ldR19P — google_classroom/GOOGLE_CLASSROOM_COURSES_ALIASES_CREATE (403) → PERMISSION_DENIED → PLATFORM_AUTH_ISSUElog_LM3gY65V2r7z — clickmeeting/GET_SESSION_ATTENDEE_DETAILS — 404 with literal stale fixture id '9640999' → RESOURCE_NOT_FOUND → PLATFORM_FIXTURE_ISSUElog_7ku09bAhphdW — clickmeeting/GET_SESSION_ATTENDEE_DETAILS — 404 with unsubstituted {{clickmeeting_conference_id}} → INVALID_FIXTURE → PLATFORM_FIXTURE_ISSUEwatchdog/tests/test_bug_classification_completeness.py — 16 tests that fail make chk if:
| Test | Catches |
|---|---|
test_every_cluster_has_a_default_bucket_or_falls_through_to_investigate | New CLUSTER_FOO constant added without a bucket mapping |
test_every_cluster_has_a_rationale_string | Cluster missing its dashboard rationale text |
test_every_bucket_has_owner_action_and_description | Bucket missing owner / action / description |
test_dispatch_keys_only_reference_real_clusters | Stale dispatch entry pointing at deleted cluster |
test_dispatch_values_only_reference_real_buckets | Bucket value typo in dispatch dict |
test_dashboard_other_count_does_not_regress | Regex change letting bugs regress into OTHER (baseline = 5) |
test_reclassify_taxonomy_runs_against_synthetic_inputs | End-to-end smoke test that doesn't depend on the seed file |
| 4 direct cluster regression tests | Re-introducing the patterns the audit found (quota / template var / permission plural / sentinel-in-URL / lowercase-aaaa false positive) |
| 4 toolkit-bug spot-check tests | Re-introducing the toolkit-bug demotion logic |
| Path | Purpose |
|---|---|
watchdog/pipelines/bug_classification/cluster.py | 10-cluster taxonomy + per-bug rationale generation |
watchdog/pipelines/bug_classification/categorize.py | Top-level bucket mapping + toolkit-bug spot-check |
watchdog/pipelines/bug_classification/reclassify.py | One-shot driver: reclassify the historical corpus |
watchdog/pipelines/bug_classification/dashboard_injector.py | Idempotent HTML splicer for the v6→v7 dashboard tab |
watchdog/pipelines/bug_classification/cli.py | Argparse CLI |
watchdog/tests/test_bug_classification_*.py | 34 pytest tests (cluster + completeness layers) |
bug_logs/dashboard_backup_v7/bug_analysis.html | v7 dashboard (v6 + new Categorization tab) |
bug_logs/dashboard_backup_v7/bug_categorization.json | Per-bug categorization data |
bug_logs/platform_escalation_report.md | Hand-off doc for the platform team |
bug_logs/toolkit_bugs_to_investigate.md | Engineering todo list (5 felt bugs at audit time) |
ruff format + ruff check + import sort all clean/tmp/ci_classifications_final.json + bug_logs/all_logs.json and produces all three artefactsOTHER bucket regression baseline tightened to 5toolkit_bugs_to_investigate.md and files tickets for the 5 felt bugs/Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_brief.md (Stream 3 section)/Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_log.mdEarlier rounds of this PR built an audit pipeline + daily cron + Slack notifier on the assumption that toolkit maintainers should fix stale fixtures and expired auth tokens. That assumption was wrong. The user's reframing — auth recycling and fixture rotation are platform-managed, not toolkit-team-managed — is correct. This commit removes all the wrong-scope infrastructure (audit/cron/backfill/notifier/checks/runbook) and replaces it with the right shape: a classification taxonomy + categorization layer + escalation documents that surface the issues to the right owners.
Total tests passing now: 34 (down from 38, the 4 tests deleted with the audit modules). Total cluster taxonomy precision: unchanged (OTHER still at 5 of 475 canonical bugs). Total dashboard surface: +1 new top-level "Categorization" tab in v7. Total escalation documents: 2 new self-contained markdown files ready for hand-off.