PR Custody

feat(bug-triage): add behavioral triage framework (stream 6)

@AgentWrapperchecks n/achecks…feat/stream6-behavioral-triage → next5 files · +13319 −0updated 2mo ago

▸Description

Summary

Ships the Stream 6 deliverable from the 1,403-bug orchestrator brief: a standalone HTML triage surface for the 106 bugs coarsely classified as NOT_A_FAILURE (HTTP 2xx responses that QA flagged as behaviourally suspicious, plus a long tail of 4xx/5xx that the classifier could not place), plus the post-triage feedback script that closes the prevention loop (apply_triage_verdicts.py).

Zero runtime dependencies. Vanilla JS, inline CSS, no CDN, no build step. Opens directly with file://.
Cluster-first layout. Bugs grouped by group_key, largest clusters first (biggest is chatbotkit_response_mismatch × 10), so reviewers bulk-verdict whole clusters with one click where justified.
Every card shows request payload, full action response body, upstream HTTP calls with status codes, and the action's documented Pydantic response model — extracted via AST walk against a local mercury checkout (98/106 actions resolved automatically; the remaining 8 fall back to a manual GitHub link).
Links required by the brief wired up on every card: tc_id → spreadsheet, log_id → /tmp/bug_logs_html/log_<id>.html, action name + file path → mercury GitHub. Missing per-log files are now downgraded to plain-text "(missing on disk)" markers — never a dead link.
Verdicts, notes, reviewer name persist to localStorage. Export CSV / Import CSV with per-field merge so a local notes draft is never silently wiped.
Keyboard shortcuts: 1 / 2 / 3 = working / real bug / investigate, 0 clears, N jumps to the next unreviewed card below the current scroll position (with wrap-around).
Filters: free-text (tool, action, TC, log_id, group), unreviewed checkbox, verdict dropdown.

What this PR fixes — with evidence

This PR addresses 106 bugs coarsely classified as DOESNT_NEED_CI / NOT_A_FAILURE by int-1's upstream classifier — HTTP 200/201 responses where QA flagged a perceived behavioural mismatch, plus a long tail of 4xx/5xx that the classifier could not place. Every bug in this bucket is inherently a human-judgement call: no static lint can decide whether an action's success response "looks wrong" without an oracle. The bugs trace to specific production failures captured in the 1,259-log ClickHouse export. Below are direct evidence links so reviewers can dig in.

Why this cleanup is needed

106 bugs sit in the backlog as "needs human" with nowhere to go — they block any future classifier re-run from shrinking the unresolved bucket, and the longer they stay unresolved the more the test suite drifts away from the actions' actual contract. This PR ships the triage surface that lets one reviewer power through them at scale, plus the feedback script (apply_triage_verdicts.py) that closes the loop by merging human verdicts back into the canonical classifications file. Without the feedback script the surface is a dead end; without the surface the feedback script has nothing to consume.

Sample evidence — top 10 clusters (total 52/106 bugs)

Every row links the cluster name, its bug count, three sample log_ids (with file:// evidence pages in /tmp/bug_logs_html/), and the representative TC_* spreadsheet row keys. The spreadsheet lives at: https://docs.google.com/spreadsheets/d/1IgDRdSCjFbafOooYmT7KThN4kEtOKZGUSQeFLC3AZWA/edit?gid=1296040646

Cluster	N	Sample `log_id`s	Representative TC rows
`chatbotkit_response_mismatch`	10	`log_aHtH1VloNpVe`, `log_Na2AuIOP0rIH`, `log_3LrrUewAk-p8`	TC_Chatbotkit_071–080
`bamboohr_behavioral` ⚠️	9	`log_VzjD1W8lf_1E`, `log_Jmny5Cbc-Wvk`, `log_UAn5L89A1s8w`	TC028, TC029, TC098–100, TC145, TC146, TC172, TC188
`attio_missing_test_fields`	6	`log_VSwi4r0PpokA`, `log_Y4z6DrBReRF6`, `log_8vCB5o4hCNNo`	2, 13, 49, 51, 54, 63
`NOCLUSTER:clickmeeting:GET_CONFERENCE_FILES`	5	`log_TS1DrskgGTC1` †, `log_Ur8MoGcBSVIq`, `log_ORpvlcvQhCKG` †	TC61, TC62, TC63, TC64, TC66
`NOCLUSTER:campayn:CAMPAYN_GET_CONTACT`	4	`log_wjkGDK8NzNXL`, `log_MwB_r_EetKY8`, `log_Wz4ihNtUT928`	(no TC keys recorded)
`one_drive_500`	4	`log_y7kbac3AT3Jo`, `log_L0jN9vrsJ19W`, `log_uPxWQdEgJjl2`	TC131, TC132, TC134
`shotstack_response_mismatch`	4	`log_xKabzw_QraBy`, `log_SoJrvjAWab6E`, `log_5hLs6LTMXuzy`	TC16, TC35, TC56
`typefully_behavioral`	4	`log_tUlzRpVA598E`, `log_0Ge9xxbG2rTU`, `log_IHP6s9iINC3Z`	TC_TYPEFULLY_007, _035, _037
`NOCLUSTER:campayn:CAMPAYN_GET_REPORTS`	3	`log_xI9Gxw7pCMYm`, `log_ptNQgz9OHXGs`, `log_2dOiAmNs3gu` †	(no TC keys recorded)
`NOCLUSTER:clickmeeting:GET_SESSION_DETAILS`	3

Legend: ⚠️ = cluster flagged as mis-clustered (see finding below). † = per-log HTML evidence file missing on disk (rendered as a plain-text "(missing on disk)" marker on the page — see gap closure below).

To inspect any row in a browser, open file:///tmp/bug_logs_html/<log_id>.html directly. The triage page does this for you — the Log: field on each card links to the on-disk evidence file when it exists.

Cluster-integrity finding — `bamboohr_behavioral` (⚠️)

Spot-check of the 4 largest clusters surfaced one real issue that is NOT a Stream 6 fix but that future readers of this PR should know about: bamboohr_behavioral lumps bugs across 5 different actions, and the 3 BAMBOOHR_CREATE_FILE_CATEGORY bugs (log_UAn5L89A1s8w, plus 2 siblings; TC098–100) report an XML-parse failure — "Failed to parse XML response for company files after create: not well-formed (invalid token): line 1, column 0" — that looks like a real code bug, not a judgement call. The triage UI's per-bug verdict overrides are the correct escape hatch: a reviewer working the bamboohr_behavioral cluster can verdict those 3 bugs as real_bug individually (the bulk button stays available), and apply_triage_verdicts.py will reclassify them to NEEDS_CI on the next feedback run. The upstream re-clustering fix is owned by int-1; this PR flags it via a discovery inbox line and documents it here so the audit trail survives.

Class-completeness gap closed inline — 5 missing log evidence files

Five of the 106 bugs had their per-log HTML file absent from /tmp/bug_logs_html/ because the original ClickHouse export did not capture them. These bugs were already in the classifications — their error_summary field literally reads "Log data not found in ClickHouse export" (or "status=None, no error message" for one outlier). Pre-fix the page linked to file URLs that 404'd. This PR detects the absence at generation time and downgrades those five entries to plain-text "(missing on disk)" markers with a tooltip. The affected bugs are still rendered on the page in full (request/response/response-model sections work), just without a working evidence link:

`log_id`	Tool / action	TC	`error_summary`
`log_6j8js1o3FHvO`	`google_classroom/GOOGLE_CLASSROOM_COURSE_WORK_STUDENT_SUBMISSIONS_RECLAIM`	TC_370	Log data not found in ClickHouse export
`log_TS1DrskgGTC1`	`clickmeeting/GET_CONFERENCE_FILES`	TC61	Log data not found in ClickHouse export
`log_ORpvlcvQhCKG`	`clickmeeting/GET_CONFERENCE_FILES`	TC63	Log data not found in ClickHouse export
`log_2dOiAmNs3gu`	`campayn/CAMPAYN_GET_REPORTS`	—	Log data not found in ClickHouse export
`log_oUgwmb3t8Qpu`	`bamboohr/BAMBOOHR_UPDATE_TIME_OFF_REQUEST`	TC146	status=None, no error message

8 bugs with unresolved mercury source (still rendered in full)

Eight bugs cannot be resolved to a mercury source file — their action either (a) has been renamed since the test case was recorded (which is Stream 4's slug-freshness audit territory) or (b) has a toolkit field in the classifications JSON that is a human-readable label instead of the canonical slug. Those cards still render with every other section intact; only the "Documented response model" panel falls back to "Response model unavailable — open on GitHub".

`log_id`	Tool / toolkit	TC
`log_2r8PIiLvvYvQ`	`zoho/ZOHO_CONVERT_ZOHO_LEAD`	TC_ZOHO_088
`log_nsRi_aGUhRKs`	`zoho/ZOHO_CONVERT_ZOHO_LEAD`	TC_ZOHO_089
`log_HX6EFyuiHeNv`	`zoho/ZOHO_CONVERT_ZOHO_LEAD`	TC_ZOHO_090
`log_xabgu2Qyp3KV`	`api_sports/BOTPRESS_GET_TABLE_ROW` ‡	—
`log_07XvE4oAx00H`	`api_sports/BOTPRESS_DELETE_INTEGRATION_SHAREABLE_ID` ‡	—
`log_OfQ6FxggKtGE`	`booqable/BOOQABLE_GET_PRODUCTS`	TC_023
`log_xxdcRD_cqqtp`	`enginemailer/GET_CHECKEXPORT`	TC_ENGINEMAILER_171
`log_dISSGWwJttW3`	`figma/Get component set` ‡	TC472

Legend: ‡ = the classifier recorded a tool_slug that does not match the real mercury app (api_sports vs botpress) or used a human-readable label (Get component set). Surfaced to int-1's slug-freshness audit (Stream 4) via the inbox.

Evidence inventory pointers (for future readers)

Bug corpus (full request/response payloads, 1,259 logs): /Users/equinox/.worktrees/integrator/int-1/bug_logs/all_logs.json
Per-log HTML pages (1,253 individual evidence files): /tmp/bug_logs_html/log_<id>.html
Canonical classifications: /tmp/ci_classifications_final.json (filter bug_classifications where category=="DOESNT_NEED_CI" AND doesnt_need_ci_reason=="NOT_A_FAILURE")
Bug analysis dashboard: /Users/equinox/.worktrees/integrator/int-1/bug_logs/dashboard_backup_v6/bug_analysis.html
Orchestrator brief (Stream 6 section): /Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_brief.md
Post-triage feedback script: bug_logs/apply_triage_verdicts.py (this PR). Run python3 bug_logs/apply_triage_verdicts.py --dry-run to see the verdict→category mapping applied to a CSV without writing the output file.

Class-completeness audit

Run before merge per the cleanup-class principle (fix half and prevention half ship together):

Check	Result
Coverage — all 106 `NOT_A_FAILURE` bugs from `/tmp/ci_classifications_final.json` rendered	✅ 106/106, including the 8 with no auto-extracted response model (rendered with the "Response model unavailable" fallback + manual GitHub link)
Hyperlink — TC → spreadsheet	✅ present on all 106 cards
Hyperlink — `log_id` → per-log HTML	✅ 101/106 link to a real on-disk file; 5 render as "(missing on disk)" plain-text markers (those bugs are the ones whose `error_summary` literally says "Log data not found in ClickHouse export") — fixed in this PR by checking existence at generation time so no link 404s
Hyperlink — action → mercury GitHub	✅ 98/106 resolve to a real `apps/<slug>/actions/<file>.py` path; 8 unresolved (fallback "source not resolved" — these are mostly actions that have been renamed since the test cases were recorded, which is Stream 4's job to track down)
Cluster integrity (spot check)	⚠️ Findings below

Cluster-integrity findings

Spot-checked the 4 largest clusters + a singleton. The 46 clusters come pre-baked from int-1's upstream classifier (group_key field on each bug); Stream 6 faithfully renders them but did not derive them.

Cluster	Coherent?	Notes
`chatbotkit_response_mismatch` (10)	✅	Same tool/action × 10 test cases, same generic error
`bamboohr_behavioral` (9)	❌	Lumps 5 different actions; the 3 `BAMBOOHR_CREATE_FILE_CATEGORY` bugs report an XML-parse failure that looks like a real bug, not a behavioral judgment call
`attio_missing_test_fields` (6)	✅	Different actions but share "missing required fields" pattern (test-data issue)
`NOCLUSTER:clickmeeting:GET_CONFERENCE_FILES` (5)	⚠️	Same tool/action but mixed root causes (missing fields, 404, 200, log-data-not-found)
22 singletons	⚠️	All `NOCLUSTER:*` — upstream classifier punted on grouping for these

This is an upstream-classifier finding, not a Stream 6 fix-half gap. The triage UI's per-bug verdict overrides are the right escape hatch for a mixed cluster (a reviewer can click through the cluster and verdict bugs differently). Surfaced to int-1 as a discovery line in the orchestrator inbox.

Prevention layer

For a behavioural-judgment bug class like NOT_A_FAILURE there is no clean static lint that prevents the class — these are HTTP 200 responses where QA flagged a perceived mismatch, with no oracle for "right" without a spec. The prevention half therefore takes a different shape: a feedback loop that ensures every triaged verdict reaches the next iteration of the classifier, so a bug a human flips to NEEDS_CI doesn't keep showing up as NOT_A_FAILURE on every re-run.

bug_logs/apply_triage_verdicts.py is that loop:

python3 bug_logs/apply_triage_verdicts.py \
    --classifications /tmp/ci_classifications_final.json \
    --csv bug_logs/behavioral_triage_results.csv \
    --out /tmp/ci_classifications_v7.json

CSV verdict	Action on the bug	New `category` / `doesnt_need_ci_reason`
`working_as_intended`	Stays in `NOT_A_FAILURE`, gains a `human_triage` audit stamp	`DOESNT_NEED_CI` / `NOT_A_FAILURE`
`real_bug`	Reclassified to `NEEDS_CI` (lint design left to int-1's slate, `proposed_lint=null`)	`NEEDS_CI` / `null`
`needs_investigation`	Reclassified to the human-review bucket	`DOESNT_NEED_CI` / `NEEDS_HUMAN`
(blank)	Untouched (no-op)	unchanged

Hard rules baked in:

Stays in scope. Only bugs currently in NOT_A_FAILURE are eligible — CSV rows targeting bugs from other buckets are reported in skipped_off_scope rather than silently rewritten.
Never overwrites in place. Output goes to --out (default: ci_classifications_v7.json next to the input).
Idempotent. Re-running with the same CSV is a no-op.
Audit trail. Every run appends to a top-level triage_runs list on the output JSON.
Dry-run. --dry-run prints the summary table without writing the output file.

Files

Path	What it is
`bug_logs/build_behavioral_triage.py`	Generator CLI. All paths overridable via CLI flags or env vars (`CI_CLASSIFICATIONS`, `INTEGRATOR_BUG_LOGS`, `MERCURY_REPO`, `BUG_LOGS_HTML_DIR`). Pure stdlib.
`bug_logs/dashboard_backup_v6/behavioral_triage.html`	Generated surface — 106 cards, 46 clusters, 424 verdict buttons.
`bug_logs/behavioral_triage_results.csv`	Blank CSV template with the canonical column order.
`bug_logs/apply_triage_verdicts.py`	Prevention-layer script. Reads the triage CSV, emits an updated `ci_classifications_v7.json`.
`bug_logs/README.md`	Generator usage, reviewer workflow, post-triage feedback flow, design notes.

Test plan

🤖 Generated with Claude Code

loading diff…

Summary

What this PR fixes — with evidence

Why this cleanup is needed

Sample evidence — top 10 clusters (total 52/106 bugs)

Cluster-integrity finding — bamboohr_behavioral (⚠️)

Class-completeness gap closed inline — 5 missing log evidence files

8 bugs with unresolved mercury source (still rendered in full)

Evidence inventory pointers (for future readers)

Class-completeness audit

Cluster-integrity findings

Prevention layer

Files

Test plan

Summary

What this PR fixes — with evidence

Why this cleanup is needed

Sample evidence — top 10 clusters (total 52/106 bugs)

Cluster-integrity finding — bamboohr_behavioral (⚠️)

Class-completeness gap closed inline — 5 missing log evidence files

8 bugs with unresolved mercury source (still rendered in full)

Evidence inventory pointers (for future readers)

Class-completeness audit

Cluster-integrity findings

Prevention layer

Files

Test plan

Cluster-integrity finding — `bamboohr_behavioral` (⚠️)

Cluster-integrity finding — `bamboohr_behavioral` (⚠️)