PR Custody

fix(create_app): handle Tagger/ScopeMapper failures gracefully

@AgentWrapperchecks n/achecks…feat/INT-1439 → next1 files · +48 −3updated 3mo ago

▸Description · 1 noise

Why

Production create-app workflows (e.g., 8jacktro/microsoft_teams, 1fpk8sx7/outlook) are failing with "Control request timeout: initialize" despite 91-95% of actions passing test-and-fix. No PRs are created because the Tagger/ScopeMapper exception propagates and fails the entire workflow.

The root cause is that ThreadPoolExecutor(max_workers=2) runs Tagger and ScopeMapper in parallel, and ScopeMapper spawns up to 40 parallel Claude agents for scope discovery. The concurrent Claude SDK client initializations cause timeout errors, and the unprotected .result() calls propagate the exception.

What

Wrap tagger_future.result() and scope_future.result() in try/except blocks
On failure, log the error, record it in run_summary, and default to empty tags/scopes
Add imports for TagClassifierResponse and MapActionScopesResponse to construct fallback objects
PR creation proceeds normally since empty metadata is a no-op

How to Test

make fmt && make chk — ruff passes (pyrefly failure is pre-existing)
uv run pytest cortex/tests/test_workflows/ -v — 2/2 pass
The fix is safe because:
- Empty tags_by_action / scopes_by_action → empty metadata dict → batch_update_action_metadata() is a no-op
- success = len(passed_actions) > 0 is unaffected
- PR creation proceeds normally

Pre-Review Checklist

I have self-reviewed this PR

Notes

Tagger and ScopeMapper are non-critical metadata enrichment (tags + OAuth scopes). Actions are already built, tested, and passing at this point.
A follow-up could add retry logic or stagger the parallel agents, but graceful degradation is the immediate priority.

🤖 Generated with Claude Code

▸ 1 bot/status comment hidden

@linear[bot]3mo ago

Context

Samvit reported in Slack thread that PRs are not getting created for create-app runs on AWS.

Workflow IDs to investigate: 8jacktro, 1fpk8sx7

Error: "Control request timeout initialise" — happening across both workflows.

What to do

Query staging DB for these workflow IDs to get execution state and error details
Check Datadog logs for these specific runs
Check ECS task logs — could be task timeout, OOM, or connection init failure
Determine if this is infra config (timeout too short, memory limits) or a code bug
Fix the root cause

Slack screenshot: file F0AGK8PC86P in channel C08MANG1LAW
This is separate from the auth/naming issues fixed in PRs #1248, #1249, #1250

loading diff…

fix(create_app): handle Tagger/ScopeMapper failures gracefully

Why

What

How to Test

Pre-Review Checklist

Notes

Context

What to do

Related

Why

What

How to Test

Pre-Review Checklist

Notes

Context

What to do

Related