Ships the Stream 6 deliverable from the 1,403-bug orchestrator brief: a
standalone HTML triage surface for the 106 bugs coarsely classified as
NOT_A_FAILURE (HTTP 2xx responses that QA flagged as behaviourally
suspicious, plus a long tail of 4xx/5xx that the classifier could not
place), plus the post-triage feedback script that closes the prevention
loop (apply_triage_verdicts.py).
file://.group_key, largest clusters first (biggest is chatbotkit_response_mismatch × 10), so reviewers bulk-verdict whole clusters with one click where justified.tc_id → spreadsheet, log_id → /tmp/bug_logs_html/log_<id>.html, action name + file path → mercury GitHub. Missing per-log files are now downgraded to plain-text "(missing on disk)" markers — never a dead link.localStorage. Export CSV / Import CSV with per-field merge so a local notes draft is never silently wiped.1 / 2 / 3 = working / real bug / investigate, 0 clears, N jumps to the next unreviewed card below the current scroll position (with wrap-around).This PR addresses 106 bugs coarsely classified as DOESNT_NEED_CI / NOT_A_FAILURE
by int-1's upstream classifier — HTTP 200/201 responses where QA flagged a
perceived behavioural mismatch, plus a long tail of 4xx/5xx that the
classifier could not place. Every bug in this bucket is inherently a
human-judgement call: no static lint can decide whether an action's success
response "looks wrong" without an oracle. The bugs trace to specific
production failures captured in the 1,259-log ClickHouse export. Below are
direct evidence links so reviewers can dig in.
106 bugs sit in the backlog as "needs human" with nowhere to go — they
block any future classifier re-run from shrinking the unresolved bucket,
and the longer they stay unresolved the more the test suite drifts away
from the actions' actual contract. This PR ships the triage surface that
lets one reviewer power through them at scale, plus the feedback
script (apply_triage_verdicts.py) that closes the loop by merging
human verdicts back into the canonical classifications file. Without the
feedback script the surface is a dead end; without the surface the
feedback script has nothing to consume.
Every row links the cluster name, its bug count, three sample log_ids
(with file:// evidence pages in /tmp/bug_logs_html/), and the
representative TC_* spreadsheet row keys. The spreadsheet lives at:
https://docs.google.com/spreadsheets/d/1IgDRdSCjFbafOooYmT7KThN4kEtOKZGUSQeFLC3AZWA/edit?gid=1296040646
| Cluster | N | Sample log_ids | Representative TC rows |
|---|---|---|---|
chatbotkit_response_mismatch | 10 | log_aHtH1VloNpVe, log_Na2AuIOP0rIH, log_3LrrUewAk-p8 | TC_Chatbotkit_071–080 |
bamboohr_behavioral ⚠️ | 9 | log_VzjD1W8lf_1E, log_Jmny5Cbc-Wvk, log_UAn5L89A1s8w | TC028, TC029, TC098–100, TC145, TC146, TC172, TC188 |
attio_missing_test_fields | 6 | log_VSwi4r0PpokA, log_Y4z6DrBReRF6, log_8vCB5o4hCNNo | 2, 13, 49, 51, 54, 63 |
NOCLUSTER:clickmeeting:GET_CONFERENCE_FILES | 5 | log_TS1DrskgGTC1 †, log_Ur8MoGcBSVIq, log_ORpvlcvQhCKG † | TC61, TC62, TC63, TC64, TC66 |
NOCLUSTER:campayn:CAMPAYN_GET_CONTACT | 4 | log_wjkGDK8NzNXL, log_MwB_r_EetKY8, log_Wz4ihNtUT928 | (no TC keys recorded) |
one_drive_500 | 4 | log_y7kbac3AT3Jo, log_L0jN9vrsJ19W, log_uPxWQdEgJjl2 | TC131, TC132, TC134 |
shotstack_response_mismatch | 4 | log_xKabzw_QraBy, log_SoJrvjAWab6E, log_5hLs6LTMXuzy | TC16, TC35, TC56 |
typefully_behavioral | 4 | log_tUlzRpVA598E, log_0Ge9xxbG2rTU, log_IHP6s9iINC3Z | TC_TYPEFULLY_007, _035, _037 |
NOCLUSTER:campayn:CAMPAYN_GET_REPORTS | 3 | log_xI9Gxw7pCMYm, log_ptNQgz9OHXGs, log_2dOiAmNs3gu † | (no TC keys recorded) |
NOCLUSTER:clickmeeting:GET_SESSION_DETAILS | 3 |
Legend: ⚠️ = cluster flagged as mis-clustered (see finding below). † = per-log HTML evidence file missing on disk (rendered as a plain-text "(missing on disk)" marker on the page — see gap closure below).
To inspect any row in a browser, open file:///tmp/bug_logs_html/<log_id>.html
directly. The triage page does this for you — the Log: field on each
card links to the on-disk evidence file when it exists.
bamboohr_behavioral (⚠️)Spot-check of the 4 largest clusters surfaced one real issue that is NOT
a Stream 6 fix but that future readers of this PR should know about:
bamboohr_behavioral lumps bugs across 5 different actions, and the
3 BAMBOOHR_CREATE_FILE_CATEGORY bugs (log_UAn5L89A1s8w, plus 2
siblings; TC098–100) report an XML-parse failure — "Failed to parse XML response for company files after create: not well-formed (invalid token): line 1, column 0" — that looks like a real code bug, not a judgement
call. The triage UI's per-bug verdict overrides are the correct escape
hatch: a reviewer working the bamboohr_behavioral cluster can verdict
those 3 bugs as real_bug individually (the bulk button stays
available), and apply_triage_verdicts.py will reclassify them to
NEEDS_CI on the next feedback run. The upstream re-clustering fix is
owned by int-1; this PR flags it via a discovery inbox line and
documents it here so the audit trail survives.
Five of the 106 bugs had their per-log HTML file absent from
/tmp/bug_logs_html/ because the original ClickHouse export did not
capture them. These bugs were already in the classifications — their
error_summary field literally reads "Log data not found in ClickHouse export" (or "status=None, no error message" for one outlier). Pre-fix
the page linked to file URLs that 404'd. This PR detects the absence at
generation time and downgrades those five entries to plain-text
"(missing on disk)" markers with a tooltip. The affected bugs are
still rendered on the page in full (request/response/response-model
sections work), just without a working evidence link:
log_id | Tool / action | TC | error_summary |
|---|---|---|---|
log_6j8js1o3FHvO | google_classroom/GOOGLE_CLASSROOM_COURSE_WORK_STUDENT_SUBMISSIONS_RECLAIM | TC_370 | Log data not found in ClickHouse export |
log_TS1DrskgGTC1 | clickmeeting/GET_CONFERENCE_FILES | TC61 | Log data not found in ClickHouse export |
log_ORpvlcvQhCKG | clickmeeting/GET_CONFERENCE_FILES | TC63 | Log data not found in ClickHouse export |
log_2dOiAmNs3gu | campayn/CAMPAYN_GET_REPORTS | — | Log data not found in ClickHouse export |
log_oUgwmb3t8Qpu | bamboohr/BAMBOOHR_UPDATE_TIME_OFF_REQUEST | TC146 | status=None, no error message |
Eight bugs cannot be resolved to a mercury source file — their action
either (a) has been renamed since the test case was recorded (which is
Stream 4's slug-freshness audit territory) or (b) has a toolkit field
in the classifications JSON that is a human-readable label instead of
the canonical slug. Those cards still render with every other section
intact; only the "Documented response model" panel falls back to
"Response model unavailable — open on GitHub".
log_id | Tool / toolkit | TC |
|---|---|---|
log_2r8PIiLvvYvQ | zoho/ZOHO_CONVERT_ZOHO_LEAD | TC_ZOHO_088 |
log_nsRi_aGUhRKs | zoho/ZOHO_CONVERT_ZOHO_LEAD | TC_ZOHO_089 |
log_HX6EFyuiHeNv | zoho/ZOHO_CONVERT_ZOHO_LEAD | TC_ZOHO_090 |
log_xabgu2Qyp3KV | api_sports/BOTPRESS_GET_TABLE_ROW ‡ | — |
log_07XvE4oAx00H | api_sports/BOTPRESS_DELETE_INTEGRATION_SHAREABLE_ID ‡ | — |
log_OfQ6FxggKtGE | booqable/BOOQABLE_GET_PRODUCTS | TC_023 |
log_xxdcRD_cqqtp | enginemailer/GET_CHECKEXPORT | TC_ENGINEMAILER_171 |
log_dISSGWwJttW3 | figma/Get component set ‡ | TC472 |
Legend: ‡ = the classifier recorded a tool_slug that does not
match the real mercury app (api_sports vs botpress) or used a
human-readable label (Get component set). Surfaced to int-1's
slug-freshness audit (Stream 4) via the inbox.
/Users/equinox/.worktrees/integrator/int-1/bug_logs/all_logs.json/tmp/bug_logs_html/log_<id>.html/tmp/ci_classifications_final.json (filter bug_classifications where category=="DOESNT_NEED_CI" AND doesnt_need_ci_reason=="NOT_A_FAILURE")/Users/equinox/.worktrees/integrator/int-1/bug_logs/dashboard_backup_v6/bug_analysis.html/Users/equinox/.worktrees/integrator/int-1/bug_logs/orchestrator_brief.mdbug_logs/apply_triage_verdicts.py (this PR). Run python3 bug_logs/apply_triage_verdicts.py --dry-run to see the verdict→category mapping applied to a CSV without writing the output file.Run before merge per the cleanup-class principle (fix half and prevention half ship together):
| Check | Result |
|---|---|
Coverage — all 106 NOT_A_FAILURE bugs from /tmp/ci_classifications_final.json rendered | ✅ 106/106, including the 8 with no auto-extracted response model (rendered with the "Response model unavailable" fallback + manual GitHub link) |
| Hyperlink — TC → spreadsheet | ✅ present on all 106 cards |
Hyperlink — log_id → per-log HTML | ✅ 101/106 link to a real on-disk file; 5 render as "(missing on disk)" plain-text markers (those bugs are the ones whose error_summary literally says "Log data not found in ClickHouse export") — fixed in this PR by checking existence at generation time so no link 404s |
| Hyperlink — action → mercury GitHub | ✅ 98/106 resolve to a real apps/<slug>/actions/<file>.py path; 8 unresolved (fallback "source not resolved" — these are mostly actions that have been renamed since the test cases were recorded, which is Stream 4's job to track down) |
| Cluster integrity (spot check) | ⚠️ Findings below |
Spot-checked the 4 largest clusters + a singleton. The 46 clusters come pre-baked from int-1's upstream classifier (group_key field on each bug); Stream 6 faithfully renders them but did not derive them.
| Cluster | Coherent? | Notes |
|---|---|---|
chatbotkit_response_mismatch (10) | ✅ | Same tool/action × 10 test cases, same generic error |
bamboohr_behavioral (9) | ❌ | Lumps 5 different actions; the 3 BAMBOOHR_CREATE_FILE_CATEGORY bugs report an XML-parse failure that looks like a real bug, not a behavioral judgment call |
attio_missing_test_fields (6) | ✅ | Different actions but share "missing required fields" pattern (test-data issue) |
NOCLUSTER:clickmeeting:GET_CONFERENCE_FILES (5) | ⚠️ | Same tool/action but mixed root causes (missing fields, 404, 200, log-data-not-found) |
| 22 singletons | ⚠️ | All NOCLUSTER:* — upstream classifier punted on grouping for these |
This is an upstream-classifier finding, not a Stream 6 fix-half gap. The triage UI's per-bug verdict overrides are the right escape hatch for a mixed cluster (a reviewer can click through the cluster and verdict bugs differently). Surfaced to int-1 as a discovery line in the orchestrator inbox.
For a behavioural-judgment bug class like NOT_A_FAILURE there is no clean static lint that prevents the class — these are HTTP 200 responses where QA flagged a perceived mismatch, with no oracle for "right" without a spec. The prevention half therefore takes a different shape: a feedback loop that ensures every triaged verdict reaches the next iteration of the classifier, so a bug a human flips to NEEDS_CI doesn't keep showing up as NOT_A_FAILURE on every re-run.
bug_logs/apply_triage_verdicts.py is that loop:
python3 bug_logs/apply_triage_verdicts.py \
--classifications /tmp/ci_classifications_final.json \
--csv bug_logs/behavioral_triage_results.csv \
--out /tmp/ci_classifications_v7.json
| CSV verdict | Action on the bug | New category / doesnt_need_ci_reason |
|---|---|---|
working_as_intended | Stays in NOT_A_FAILURE, gains a human_triage audit stamp | DOESNT_NEED_CI / NOT_A_FAILURE |
real_bug | Reclassified to NEEDS_CI (lint design left to int-1's slate, proposed_lint=null) | NEEDS_CI / null |
needs_investigation | Reclassified to the human-review bucket | DOESNT_NEED_CI / NEEDS_HUMAN |
| (blank) | Untouched (no-op) | unchanged |
Hard rules baked in:
NOT_A_FAILURE are eligible — CSV rows targeting bugs from other buckets are reported in skipped_off_scope rather than silently rewritten.--out (default: ci_classifications_v7.json next to the input).triage_runs list on the output JSON.--dry-run prints the summary table without writing the output file.| Path | What it is |
|---|---|
bug_logs/build_behavioral_triage.py | Generator CLI. All paths overridable via CLI flags or env vars (CI_CLASSIFICATIONS, INTEGRATOR_BUG_LOGS, MERCURY_REPO, BUG_LOGS_HTML_DIR). Pure stdlib. |
bug_logs/dashboard_backup_v6/behavioral_triage.html | Generated surface — 106 cards, 46 clusters, 424 verdict buttons. |
bug_logs/behavioral_triage_results.csv | Blank CSV template with the canonical column order. |
bug_logs/apply_triage_verdicts.py | Prevention-layer script. Reads the triage CSV, emits an updated ci_classifications_v7.json. |
bug_logs/README.md | Generator usage, reviewer workflow, post-triage feedback flow, design notes. |
localStorage, card class, pill text, TOC dot, and summary counters.chatbotkit_response_mismatch went reviewed=10/working=10 in one click).N shortcut advance forward instead of jumping back.<script> JSON escapes </ and <!-- so a future data refresh containing </script> cannot break out.apply_triage_verdicts.py against a synthetic CSV exercising all six code paths (kept_NOT_A_FAILURE, reclassified_NEEDS_CI, reclassified_NEEDS_HUMAN, skipped_blank, skipped_unknown_log_id, skipped_off_scope) — every transformation correct.apply_triage_verdicts.py against its own output makes no further changes.python3 bug_logs/build_behavioral_triage.py --mercury NONE --bug-logs-html-dir NONE runs clean (response models + log links gracefully skipped).🤖 Generated with Claude Code
log_ftyzZljhj7iq, log_pifGwBgaxkZT, log_rVgrMoAgmNX0 |
| TC116, TC117, TC118 |
make chk