PR Custody

fix: reduce agent context bloat from oversized tool results

@venkat82checks n/achecks…fix/reduce-agent-context-bloat → next9 files · +64 −27updated 3mo ago

▸Description

Why

Analysis of a comprehensive_test_fix_agent Datadog trace (exec 95feb68, workflow 5mdz2dp7, cost $1.39 across 33 turns) revealed that 76% of context tokens came from oversized tool results. 42% of the total cost was cache reads of accumulated context — meaning the agent was repeatedly re-reading bloated tool results turn after turn.

Three specific patterns accounted for the bulk of wasted context:

1. `execute_parent_action` returned full raw API responses

A single execute_parent_action call returned 26.6K chars when only ~1.5K was useful. The return value included the full ExecuteParentActionResponse object (with raw SharePoint API payloads, download URLs, auth tokens) plus all historical successful responses from the test-and-fix phase. The agent only ever uses the result_summary (already capped at 2000 chars and LLM-generated to contain extracted IDs and status).

2. `execute_parent_action_with_params` had the same bloat

Both tool_provider wrappers returned {"response": ..., "additional_responses": [...]} where additional_responses contained raw execution logs from every successful test-and-fix attempt. This data is never re-read after being sent to Claude — state tracking happens separately via state.successful_responses.

3. "Action not found" errors dumped entire slug lists

When an action name wasn't found, the error messages included the complete list of all action slugs for the app (e.g., one_drive has 30K+ chars of action names). This happened twice in the analyzed trace, adding ~60K chars of dead context. The agent doesn't parse these lists — it just needs a hint about similar names.

What

Slim execute_parent_action return to {"status": ..., "result_summary": ...} — drops raw API payloads and historical responses
Drop additional_responses from both execute_parent_action_with_params wrapper returns in test_and_fix_agent/tool_provider.py and execute_parent_action/tool_provider.py
Add suggest_similar_slugs() helper to cortex/common/mercury_utils.py using difflib.get_close_matches (stdlib, already used in codebase)
Replace full slug dumps with top-5 fuzzy-match suggestions in all 5 error locations across run_action, run_trigger, test_and_fix_agent/tool.py, and runner_helper.py
Update test assertions to match new error message format

How to Test

make fmt && make chk
uv run pytest cortex/tests/test_tools/test_run_action.py cortex/tests/test_tools/test_run_trigger.py -v

All 28 tests pass (1 skipped), lint and type checks pass.

Pre-Review Checklist

I have self-reviewed this PR

Notes

All return values changed are fire-and-forget to Claude — no downstream code reads them after the tool result is sent. Verified by tracing all callers.
The result_summary field is already capped at 2000 chars and LLM-generated by the ExecuteParentActionRunner to contain all essential information (IDs, parameters used, execution outcome).
The fuzzy matching uses cutoff=0.4 which is intentionally loose to catch prefix mismatches like LIST_DRIVES → ONE_DRIVE_LIST_DRIVES.

🤖 Generated with Claude Code

loading diff…

Why

1. execute_parent_action returned full raw API responses

2. execute_parent_action_with_params had the same bloat

3. "Action not found" errors dumped entire slug lists

What

How to Test

Pre-Review Checklist

Notes

Why

1. execute_parent_action returned full raw API responses

2. execute_parent_action_with_params had the same bloat

3. "Action not found" errors dumped entire slug lists

What

How to Test

Pre-Review Checklist

Notes

1. `execute_parent_action` returned full raw API responses

2. `execute_parent_action_with_params` had the same bloat

1. `execute_parent_action` returned full raw API responses

2. `execute_parent_action_with_params` had the same bloat