Analysis of a comprehensive_test_fix_agent Datadog trace (exec 95feb68, workflow 5mdz2dp7, cost $1.39 across 33 turns) revealed that 76% of context tokens came from oversized tool results. 42% of the total cost was cache reads of accumulated context — meaning the agent was repeatedly re-reading bloated tool results turn after turn.
Three specific patterns accounted for the bulk of wasted context:
execute_parent_action returned full raw API responsesA single execute_parent_action call returned 26.6K chars when only ~1.5K was useful. The return value included the full ExecuteParentActionResponse object (with raw SharePoint API payloads, download URLs, auth tokens) plus all historical successful responses from the test-and-fix phase. The agent only ever uses the result_summary (already capped at 2000 chars and LLM-generated to contain extracted IDs and status).
execute_parent_action_with_params had the same bloatBoth tool_provider wrappers returned {"response": ..., "additional_responses": [...]} where additional_responses contained raw execution logs from every successful test-and-fix attempt. This data is never re-read after being sent to Claude — state tracking happens separately via state.successful_responses.
When an action name wasn't found, the error messages included the complete list of all action slugs for the app (e.g., one_drive has 30K+ chars of action names). This happened twice in the analyzed trace, adding ~60K chars of dead context. The agent doesn't parse these lists — it just needs a hint about similar names.
execute_parent_action return to {"status": ..., "result_summary": ...} — drops raw API payloads and historical responsesadditional_responses from both execute_parent_action_with_params wrapper returns in test_and_fix_agent/tool_provider.py and execute_parent_action/tool_provider.pysuggest_similar_slugs() helper to cortex/common/mercury_utils.py using difflib.get_close_matches (stdlib, already used in codebase)run_action, run_trigger, test_and_fix_agent/tool.py, and runner_helper.pymake fmt && make chk
uv run pytest cortex/tests/test_tools/test_run_action.py cortex/tests/test_tools/test_run_trigger.py -v
All 28 tests pass (1 skipped), lint and type checks pass.
result_summary field is already capped at 2000 chars and LLM-generated by the ExecuteParentActionRunner to contain all essential information (IDs, parameters used, execution outcome).cutoff=0.4 which is intentionally loose to catch prefix mismatches like LIST_DRIVES → ONE_DRIVE_LIST_DRIVES.🤖 Generated with Claude Code