PR Custody

chore(integration-tests): add setup_mercury diagnostics + bump workflow timeout

@AgentWrapperchecks n/achecks…feat/investigate-integration-test-timeouts-in-tests-integration-t → next2 files · +300 −28updated 2mo ago

GitHub

▸Description

Why

The simple-tracing integration tests have been failing in CI for ~1 week with:

Failed: Workflow did not complete within 120s

The 120s timeout is misleading. Looking at the runner-side logs in the failing run (example), the runner container is exiting with code 1 in ~2s inside infra/setup_mercury.py:

WORKFLOW_CONFIG provided - using environment variable
No INTEGRATOR_BRANCH specified, skipping integrator setup
Found pre-built Mercury at /tmp/mercury
Already on branch master, pulling latest changes...
Error: Failed to pull latest changes from master

The test then waits the full 120s for a DB state transition that will never come — cortex.workflow_dispatcher (which writes terminal state to cortex_execution) is never reached.

The captured stderr from the failing git pull was being silently dropped by the wrapper try/except, so we can't tell which subprocess call failed or why. This PR doesn't fix the underlying bug — it adds enough diagnostics to root-cause it from the next CI run.

Why only 3 of 20 integration tests are failing

Tests that don't spawn the runner (8× test_agent_traces.py, 9× test_docker_entrypoint.py, test_health_endpoint) → pass
Tests that spawn the runner but don't wait for terminal DB state (test_api_creates_execution_record, test_poller_service_starts_without_import_errors) → pass
Tests that do wait for terminal DB state (test_full_workflow_execution_and_state_transitions, test_workflow_with_custom_branch, test_workflow_updates_state_transitions) → fail

Same skip_agent_execution=True config; the difference is whether the test depends on the runner getting past setup_mercury.py.

What

infra/setup_mercury.py — diagnostic-only:

Log captured stdout/stderr/exit-code on every git pull (was silently captured + discarded).
Leave a [setup_mercury] step: <name> breadcrumb before each subprocess call so we know which one fails even if its output is lost.
New _diagnostic_dump() helper prints runtime context — GITHUB_ACCESS_TOKEN presence + length (never the value), git version, mercury HEAD, submodule config, protos remote URL — sanitised. Called before pull and after failure.
Print full traceback from main() so we see the underlying CalledProcessError chain, not just the wrapper RuntimeError message.

tests/integration/test_simple_tracing_workflow.py — small behavioral change:

Bump WORKFLOW_TIMEOUT 120s → 600s so 6 reruns of a fast-failing runner don't burn 12 minutes of CI wall time, and so legitimately slow runs (cold caches, custom-branch checkout of 47K files) have headroom.

What this PR does NOT do

This is diagnostic-only. The actual root-cause fix (whatever's wrong with git pull/submodule setup at runtime) is intentionally deferred to a follow-up PR once the next CI run shows us:

Which step in the master-pull path fails (_configure_submodule_auth, _set_submodule_remote, git pull, _update_submodules, …)
The actual git stderr explaining why
Whether GITHUB_ACCESS_TOKEN is present at runtime and well-formed

Test plan

CI run completes (the timeout bump alone won't make tests pass — the runner still crashes — but the runner-side logs in the run output should now contain enough info to root-cause the failure)
Follow-up PR with the actual fix once we see the diagnostics

🤖 Generated with Claude Code

loading diff…