PR Custody

feat(bug-triage): action slug freshness audit + fix-action lifecycle (streams 4+5)

@AgentWrapperchecks n/achecks…feat/stream4-5-slug-and-fix-action → next20 files · +8062 −0updated 2mo ago

▸Description

Summary

Ships Stream 4 (action-slug freshness audit) and Stream 5 (fix-action workflow lifecycle) from the bug-analysis orchestration brief, plus the class-completeness audit + prevention layers for both halves.

What this PR fixes — with evidence

This PR addresses two cleanup classes: (a) action-slug drift in the bug analysis pipeline (114 bugs across 48 analyzed slugs whose slugs no longer resolved to live actions in the prod API), and (b) the absence of any visibility into the fix-action workflow lifecycle for the 5 RUN_FIXER bugs the orchestrator earmarked. The bugs trace to specific production failures in our test runs; the evidence links below take a reviewer from PR → cluster → spreadsheet row → per-log HTML payloads.

Stream 4 — slug audit evidence

Source data: int-1/bug_logs/all_logs.json (1,259 ClickHouse log records, ~5MB) and dashboard_backup_v6/action_level_analysis.json (624 unique action slugs, 1,498 total bugs).

Scope: 579 of the 624 slugs were analyzed (the remainder lacked usable data). Of those:

519 already fresh in prod
59 stale → 46 auto-accepted mappings (33 by structural matcher + 13 human-verified) + 4 mis-attributed in source + 9 needs-human + 1 unprobeable

Bugs whose evidence links now resolve correctly after the in-place patches: 114 bugs across 48 analyzed slug rows (out of 1,498 total bugs in the corpus). Sample evidence:

canonical mapping	tool	sample log	sample TC	per-log HTML	rationale
`HEYGEN_ADD_NEW_ASSET` → `HEYGEN_UPLOAD_ASSET`	heygen	`log_FzZ6dgIu-fN_`	TC-435 Heygen	`/tmp/bug_logs_html/log_FzZ6dgIu-fN_.html`	verb ADD→UPLOAD rename; the only Stream 5 SUCCESS depended on this Stream 4 mapping
`MICROSOFT_TEAMS_TEAMS_LIST_PEOPLE` → `MICROSOFT_TEAMS_LIST_PEOPLE`	microsoft_teams	`log_fTpCIgalywbF`	TC_MICROSOFTTEAMS_014	`/tmp/bug_logs_html/log_fTpCIgalywbF.html`	duplicate-segment collapse
`TODOIST_DELETE_SECTION` → `TODOIST_DELETE_SECTION2`	todoist	`log_Jr3QAU3c-xUK`	TC_TODOIST_290	`/tmp/bug_logs_html/log_Jr3QAU3c-xUK.html`	trailing `2` suffix added (pattern: 4 todoist actions like this)
`BOOQABLE_GET_ORDERS` → `BOOQABLE_GET_ORDER`	booqable	`log_07FOUy8UFOOj`	—	`/tmp/bug_logs_html/log_07FOUy8UFOOj.html`	plural→singular rename
`CRAWL_API` → `BRIGHTDATA_CRAWL_API`	brightdata	—	TC_BRIGHTDATA-003	—	toolkit prefix added (pattern: 4 brightdata slugs missing the prefix)
`BOTPRESS_CREATE_CONVERSATION` (mis-attributed)	api_sports → botpress	`log_jBX-bbhdaKy1`, `log_dx2TZra1FW_J`	—	`/tmp/bug_logs_html/log_jBX-bbhdaKy1.html`	source-data quality bug — these BOTPRESS_* slugs are real, current actions under the `botpress` toolkit, but the analysis attributed them to `api_sports`
`DOPPLER_SECRETOPS_ENVIRONMENTS_CREATE` (unprobeable)	doppler_secretops	`log_TOO6Kg42Nk_f`	TC_DOPPLER_032	`/tmp/bug_logs_html/log_TOO6Kg42Nk_f.html`	toolkit exists in live catalog (29 tools) but lacks a connected account in our envs, so the validator short-circuits before action validation. Plausible candidate: `DOPPLER_ENVIRONMENTS_CREATE` under the newer `doppler` toolkit (62 tools, created 2026-01-10)

Spreadsheet rows: the test cases above are in the integrator bug-analysis sheet (filter by TC ID).

Canonical artifacts in this PR:

bug_logs/stale_action_slugs.json — the canonical mapping (46 auto-accepted + 4 mis-attributed + 9 needs-human + 1 unprobeable, with full disposition rationale)
bug_logs/stale_action_slugs_report.md — human-readable report
bug_logs/live_action_slugs_by_toolkit.json — raw probe cache (1024 toolkits across the live catalog; 85 with bug data)
bug_logs/slug_patch_report.json — substitution audit trail with backup paths

Stream 5 — fix-action lifecycle evidence

Source data: the 5 RUN_FIXER bugs from int-1/bug_logs/dashboard_backup_v6/fixer_dive.html (10 bugs total in the dive; 5 had RUN_FIXER verdicts).

bug	tool	action	TC	log id	per-log HTML	workflow_id	terminal state	mercury PR	reviewer
#2	fireflies	`FIREFLIES_FETCH_AI_APP_OUTPUTS2`	TC_FF_001	`log_2MZj1WXiJ2wf`	`/tmp/bug_logs_html/log_2MZj1WXiJ2wf.html`	`2ukwrpfk`	FIX_REJECTED	—	`reject` (instruction described non-existent bug; useful negative signal)
#3	zendesk	`ZENDESK_CREATE_ZENDESK_USER`	TC_ZENDESK_064	`log_YwezMRkKTvcc`	`/tmp/bug_logs_html/log_YwezMRkKTvcc.html`	`dtn3qmpb`	FIX_REJECTED	—	`reject` (no code changes were made; useful negative signal about fix-brief generation quality)
#4	zoho_books	`ZOHO_BOOKS_CREATE_USER`	TC_ZOHOB_038	`log_l2u1Yr76yZTH`	`/tmp/bug_logs_html/log_l2u1Yr76yZTH.html`	`pcgrymzh`	FAILED_AUTH_ISSUE	—	n/a (env issue, not instruction defect)
#6	googlesheets	`GOOGLESHEETS_ADD_SHEET`	TC_38	`log_dRTD64JwvFSz`	`/tmp/bug_logs_html/log_dRTD64JwvFSz.html`	`3wt9r8l7`	STALLED_INDEFINITELY (8h+ silent at Reviewer step; will hit 36h TIMED_OUT budget by 2026-04-11T03:47Z)	—	n/a
#9	heygen	`HEYGEN_UPLOAD_ASSET` (rewritten from `HEYGEN_ADD_NEW_ASSET` via Stream 4)	TC-435 Heygen	`log_FzZ6dgIu-fN_`	`/tmp/bug_logs_html/log_FzZ6dgIu-fN_.html`	`83mxzcdu`	SUCCESS	#20702	`high` (the "questionable" one — test sent a `.txt` as a video — actually fixed by the agent inferring the right mimetype)

Cluster: these 5 bugs are the RUN_FIXER verdict cluster from fixer_dive.html (the 10-bug deep dive of FIXER_AGENT-classified bugs from /tmp/ci_classifications_final.json). Cluster size: 5 RUN_FIXER + 5 reclassified (DONT_RUN_FIXER: 2 NEEDS_HUMAN, 2 NOT_A_BUG, 1 TEST_AND_FIX). All 5 RUN_FIXER triggered, 0 missing, 0 extra.

Lifecycle artifacts in this PR:

bug_logs/fix_action_runs.json — schema 1.1 record per workflow with workflow_id, original_action, slug_rewritten, triggered_at, terminal_state_reached_at, mercury_branch_name, audit_state, full evidence fields
bug_logs/stream5_phase2_payloads/bug{3,4,6,9}_*.json — the exact instruction payloads sent to each workflow (audit trail)

Why this fix

Action slugs drift over time (e.g. FIREFLIES_FETCH_AI_APP_OUTPUTS → ..._OUTPUTS2, HEYGEN_ADD_NEW_ASSET → HEYGEN_UPLOAD_ASSET) and our analysis data accumulates stale references that break dashboard links and confuse remediation workflows. Stream 4 produces a canonical slug mapping (46 verified renames covering 114 stale-slug bug citations), patches the affected analysis files in-place, and ships a weekly cron that detects new drift the moment it happens. Stream 5 separately probes the fix-action workflow lifecycle by triggering the 5 RUN_FIXER bugs end-to-end: 1 SUCCESS (heygen → mercury #20702), 2 FIX_REJECTED (the fixer correctly refused bad proposals — useful negative signal about fix-brief generation quality), 1 AUTH issue, 1 STALLED_INDEFINITELY. Together with the new check_stalled_runs.py lifecycle monitor, we now have both retroactive remediation and forward-looking prevention for both classes.

Stream 4 — Action slug freshness audit

Walks every unique action slug from the 1,403-bug analysis and checks it against the live prod API.

579 slugs / 85 toolkits probed via POST /workflows/fix-action/run (using a deliberately bogus action_name + placeholder instruction). The Pydantic validator returns 422 with Available actions: [...]. A 200/201/202 would mean a real workflow was triggered — the audit aborts on first 2xx, never silently retried.
Final result: 523 fresh / 46 auto-accepted (33 matcher + 13 human-verified) / 4 mis-attributed in source / 9 needs-human / 1 unprobeable / 0 orphan.

5-rule structural matcher → noun-stemming + verb-compatible fuzzy fallback:

rule	example
prefix added	`CRAWL_API` → `BRIGHTDATA_CRAWL_API`
version suffix added	`TODOIST_DELETE_SECTION` → `TODOIST_DELETE_SECTION2`
version suffix stripped	`TODOIST_CREATE_COMMENT` → `TODOIST_CREATE_COMMENT_V1`
duplicate-segment collapsed	`MICROSOFT_TEAMS_TEAMS_LIST_PEOPLE` → `MICROSOFT_TEAMS_LIST_PEOPLE`
article removed	`WRIKE_CREATE_A_FOLDER` → `WRIKE_CREATE_FOLDER`
verb+noun preserving fuzzy	`BOOQABLE_GET_PRODUCTS` → `BOOQABLE_GET_PRODUCT`
human verified	`WRIKE_GET_GROUP_BY_ID` → `WRIKE_QUERY_SPECIFIC_GROUP` (and 12 more)

Verb-compatible pairs allow safe substitutions like GET ↔ FETCH, UPDATE ↔ PATCH ↔ MODIFY, ADD ↔ UPLOAD, but reject GET ↔ DELETE, CREATE ↔ LIST, etc.

Patches applied (in-place against int-1's worktree, originals archived under int-4/bug_logs/backups/):

dashboard_backup_v6/action_level_analysis.json — 83 substitutions
dashboard_backup_v6/log_id_only_analysis.json — 39 substitutions
dashboard_backup_v6/remarks_bugs_analysis.json — 132 substitutions
dashboard_backup_v6/bug_analysis.html — 96 substitutions
dashboard_backup_v6/fixer_dive.html — 7 substitutions + run-status injection

Stream 4 class-completeness audit

check	result
Doppler "orphan" diagnosis	Reclassified as `unprobeable_no_connected_account`. The toolkit exists in the live catalog (29 tools, created 2025-10-17) but has no connected account in our prod/stg envs, so the validator short-circuits with HTTP 404 before reaching action validation. There is also a newer `doppler` toolkit (62 tools) with a `DOPPLER_ENVIRONMENTS_CREATE` candidate. Both are recorded.
Full live-catalog comparison	Fetched the entire prod catalog (1,024 toolkits via `dashboard/postman-dashboard/list-toolkits`, paginated). All 85 toolkits referenced by the bug analysis are present in the live catalog (0 missing). The remaining 939 live toolkits aren't in the bug data — out of scope for this audit, in scope for the weekly drift check below.
26 needs-review entries dispositioned	All 26 entries now have explicit verdicts: 13 human-verified (added to auto-accepted, confidence 1.0), 4 mis-attributed in source (`BOTPRESS_` slugs incorrectly attributed to `api_sports` — they're real, current actions under the `botpress` toolkit), 9 needs-human* with specific category (`action_removed`, `action_split`, `renamed_uncertain`).

Stream 4 prevention layer

bug_logs/check_slug_drift.py re-probes every toolkit in the committed baseline and surfaces three drift categories (new_actions, removed_actions, unprobeable). Reuses the same 2xx-abort safety guard as the audit.

.github/workflows/slug-drift-check.yml wires this into a weekly cron (Sunday 06:00 UTC). On drift, the workflow opens a bug-triage issue with the diff. On a safety abort (exit 2 — validator returned 2xx, baseline missing, etc.), it opens a separate priority-high issue. The final step fails on any non-zero exit code so neither drift nor a safety abort can pass silently.

Stream 5 — Fix-action workflow lifecycle

Phase 1 — fireflies monitoring

Workflow 2ukwrpfk already reached terminal state before handoff:

FIREFLIES_FETCH_AI_APP_OUTPUTS2 — FIX_REJECTED (reviewer_score=reject)
Reviewer feedback: the current query contains neither ai_filters nor sentences — the instruction described a non-existent bug. Likely meant get_transcript_by_id instead.

Phase 2 — Triggered the 4 remaining RUN_FIXER bugs

Used the Stream 4 mapping to substitute HEYGEN_ADD_NEW_ASSET → HEYGEN_UPLOAD_ASSET. All instruction payloads follow the fireflies template (Core issue / Code pointers / Evidence / Our understanding / Open questions).

bug	tool	action	workflow	state	PR
#2	fireflies	FIREFLIES_FETCH_AI_APP_OUTPUTS2	2ukwrpfk	FIX_REJECTED	—
#3	zendesk	ZENDESK_CREATE_ZENDESK_USER	dtn3qmpb	FIX_REJECTED	—
#4	zoho_books	ZOHO_BOOKS_CREATE_USER	pcgrymzh	FAILED_AUTH_ISSUE	—
#6	googlesheets	GOOGLESHEETS_ADD_SHEET	3wt9r8l7	STALLED_INDEFINITELY (audit-side; execution_state still STARTED)	—
#9	heygen	HEYGEN_UPLOAD_ASSET	83mxzcdu	SUCCESS	#20702 (reviewer score: high)

The "questionable" heygen one (test sent a .txt as a video) actually succeeded with a high reviewer score, vindicating the slug rewrite from Stream 4.

Stream 5 class-completeness audit

check	result
RUN_FIXER coverage	All 5 `RUN_FIXER` bugs in `fixer_dive.html` (idx {2, 3, 4, 6, 9}) were triggered. 0 missing, 0 extra.
Schema completeness	`fix_action_runs.json` normalised to schema 1.1: every entry now has `workflow_id` (as a field), `original_action`, `slug_rewritten` boolean, `triggered_at`, `terminal_state_reached_at`, `mercury_branch_name` (parsed from run_log), `audit_state`, `tc_id`, `log_id`, `file`, `task_arn`. Fireflies metadata backfilled from the brief.
googlesheets re-poll	Re-polled `3wt9r8l7` once. Still STARTED with `updated_at` unchanged at `2026-04-09T15:56:33Z` (8h+ silent at the Reviewer step). Marked `audit_state: STALLED_INDEFINITELY` with `stalled_since` timestamp. The actual `execution_state` field still reads `STARTED` because we never received a state-change event from the DB; the new `audit_state` is the int-4 audit's authoritative classification. The workflow will hit the 36-hour `TIMED_OUT` budget by `2026-04-11T03:47Z` if it doesn't recover.

Stream 5 prevention layer

bug_logs/check_stalled_runs.py walks fix_action_runs.json and flags any non-terminal workflow whose updated_at hasn't moved in --threshold-hours (default 6). Already-classified STALLED_INDEFINITELY entries are acknowledged separately so a stuck run doesn't keep alerting forever. Exits 1 on any stall — intended to be wired into either the existing poller loop or a small standalone alert script.

Tooling shipped under `bug_logs/`

script	purpose
`audit_slugs.py`	Stream 4 probe sweep (raw fuzzy matcher)
`refine_stale_mapping.py`	Stream 4 second-pass smarter matcher
`finalize_stale_mapping.py`	Stream 4 third-pass: human-verified verdicts for the 26 needs-review entries
`apply_slug_patches.py`	In-place dashboard/JSON patcher with backups
`check_slug_drift.py`	Weekly drift detector (prevention layer)
`trigger_stream5_phase2.py`	Builds + POSTs the 4 fixer payloads
`poll_fix_action_runs.py`	Polls all 5 workflows to terminal state
`check_stalled_runs.py`	Stall detector (prevention layer)
`update_fixer_dive_status.py`	Injects "Run Status" cards into `fixer_dive.html`

Out of scope (followups)

Stale toolkit names are a separate class from stale action slugs. Both doppler_secretops (renamed to doppler) and the BOTPRESS-in-api_sports rows are examples — the slug audit only handles per-toolkit action lists, not toolkit-level renames. Recorded as a discovery in the int-4 inbox for either a Stream 8 expansion or a separate follow-up task.
The remaining 9 needs-human entries (action_removed, action_split, renamed_uncertain) need product-side decisions, not pattern matching.
The 939 live toolkits outside the bug analysis scope are now monitored by the weekly drift cron but not actively audited — they'll surface as drift alerts when they actually change.

Hard rules honoured

✅ No mcp__rube__* tools — stdlib urllib only.
✅ Validation sweep never permitted a 2xx from fix-action/run (200/201/202 all treated as fatal abort).
✅ All destructive edits backed up to bug_logs/backups/ (not committed; 4.6 MB) before any modification.
✅ Coordination logged to int-1 bug_logs/orchestrator_log.md.
✅ Cleanup-class principle applied: every fix half ships with its prevention half (slug-drift cron for Stream 4, stall detector for Stream 5).
✅ Evidence-section principle applied: every cleanup PR links to specific log IDs, TC IDs, spreadsheet rows, and per-log HTML pages.

Test plan

Stream 4 probe sweep idempotent — re-running apply_slug_patches.py is a no-op
All 4 Phase-2 fix-action POSTs returned HTTP 200 with valid workflow_ids
Poller correctly transitions PENDING → STARTED → terminal states for fireflies, zendesk, zoho_books, heygen
All 26 needs-review entries dispositioned with explicit verdicts
Prevention layers smoke-tested: check_stalled_runs.py correctly detects the googlesheets stall; check_slug_drift.py parses + executes
CI pre-flight: make chk is a no-op for this PR (no app_tester/ files touched)
Final googlesheets terminal state captured in fix_action_runs.json (deferred — re-poll required after 36h TIMED_OUT budget)

🤖 Generated with Claude Code

loading diff…

Summary

What this PR fixes — with evidence

Stream 4 — slug audit evidence

Stream 5 — fix-action lifecycle evidence

Why this fix

Stream 4 — Action slug freshness audit

Stream 4 class-completeness audit

Stream 4 prevention layer

Stream 5 — Fix-action workflow lifecycle

Phase 1 — fireflies monitoring

Phase 2 — Triggered the 4 remaining RUN_FIXER bugs

Stream 5 class-completeness audit

Stream 5 prevention layer

Tooling shipped under bug_logs/

Out of scope (followups)

Hard rules honoured

Test plan

Summary

What this PR fixes — with evidence

Stream 4 — slug audit evidence

Stream 5 — fix-action lifecycle evidence

Why this fix

Stream 4 — Action slug freshness audit

Stream 4 class-completeness audit

Stream 4 prevention layer

Stream 5 — Fix-action workflow lifecycle

Phase 1 — fireflies monitoring

Phase 2 — Triggered the 4 remaining RUN_FIXER bugs

Stream 5 class-completeness audit

Stream 5 prevention layer

Tooling shipped under bug_logs/

Out of scope (followups)

Hard rules honoured

Test plan

Tooling shipped under `bug_logs/`

Tooling shipped under `bug_logs/`