fix(serverless): scrub internal infra details from customer-facing errors

@zen-agentchecks n/achecks…zen/sanitize-internal-errors-8g88r4 → master7 files · +1283 −19updated 1mo ago

▸Description· 1 comment

Description

Customer tool-execution responses were leaking the production ElastiCache hostname through a Redis ConnectionError:

{
  "data": {
    "message": "Error -2 connecting to prod-thermos-elasticache-tl43sv.serverless.use1.cache.amazonaws.com:6379. Name or service not known.",
    "status_code": null
  },
  "successful": false,
  "error": "Error -2 connecting to prod-thermos-elasticache-tl43sv.serverless.use1.cache.amazonaws.com:6379. Name or service not known.",
  "log_id": "log_kz6egqY9f0zd"
}

The leak was a direct str(exception) forward in APITool.execute's generic Exception branch (mercury/tools/api/tool.py). The same pattern existed in poll / setup / refresh / fetch_authz_entities, and in the serverless-layer traceback paths (ToolFunction.invoke, RecipeFunction.invoke, handler.run). The captured log_recorder entries forwarded by Thermos as response.Logs were a second leak channel.

What changed

New module mercury/utils/sanitization.py:

sanitize_external_error(message, *, context, log_original): detects internal-infrastructure markers and replaces the entire message with "An internal error occurred. Please try again later."
sanitize_external_payload(payload): recursive scrub of arbitrary nested structures (dict / list / tuple) — string values are passed through sanitize_external_error, non-strings are preserved.
scrub_log_entries(entries): walks log_recorder-shaped entry lists; scrubs message strings and recursively scrubs extras (which is Any).

Patterns matched (deliberately narrow):

Pattern	What it catches
Internal URI schemes	`redis://`, `rediss://`, `memcached://`, `postgres(ql)?://`, `mysql://`, `mariadb://`, `mongodb(+srv)?://`, `amqp(s)?://`, `kafka://`
AWS managed-service hostnames	`.cache.amazonaws.com` (ElastiCache), `.rds.amazonaws.com` (RDS), `.redshift.amazonaws.com`, `.compute.internal`, `.ec2.internal`, only* `internal-.elb.amazonaws.com` / `internal-.elasticloadbalancing.amazonaws.com` (AWS naming convention for non-internet-facing ELBs)
Private DNS suffixes	`.svc.cluster.local`, `.cluster.local`, `.composio.internal`, `.internal.composio.dev`
RFC 1918 / loopback / link-local IPs	`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, `127.0.0.0/8`, `169.254.0.0/16` (incl. AWS metadata `169.254.169.254`)
Composio service-host patterns	`prod-thermos-…`, `apollo-internal`, `mercury-worker-…`, etc.

Explicitly NOT matched (per codex review iter 1, #23476/pulls/comments): public internet-facing AWS endpoints — <name>.<region>.elb.amazonaws.com, *.elasticbeanstalk.com, ec2-*.compute-1.amazonaws.com. Customers' upstream services may sit on those; a 5xx from one of them is a legitimate third-party error and must remain readable.

Applied at every error-response site:

File	Sites
`mercury/tools/api/tool.py`	`execute` (HTTPError + ExecutionFailed + Exception, plus scrub of `ExecutionFailed.extra` recursively and filter out reserved keys), `poll` (HTTPError + Exception), `setup` (HTTPError + Exception), `refresh` (HTTPError + Exception), `fetch_authz_entities` (Exception)
`mercury/serverless/base.py`	`ToolFunction.invoke` and `RecipeFunction.invoke` catch-all traceback branches
`mercury/serverless/handler.py`	Top-level `run()` `Exception` and proto-serialization-error branches; defense-in-depth final-guard pass over `result["error"]`, nested `result["result"]["error"]`, `data.message`, `data.http_error` (gated on `successfull is not True` and `isinstance(..., str)` so successful responses with structured `data.message` payloads — e.g. Gmail send/draft — are not mutated); and `scrub_log_entries(result["logs"])` to close the `log_recorder` side-channel

The original message is still logged at error level server-side via the loguru sinks (Datadog/stdout) — internal-infra details remain visible for debugging — but sanitize_external_error temporarily suppresses the log_recorder ContextVar around its own log call so the diagnostic line never lands in the customer-facing result["logs"] list.

Reviewer feedback addressed

Iter	Source	Finding	Fix commit
1	codex	AWS pattern over-matches public ELB / EBeanstalk / EC2-public-DNS	`ccbe921fbc`
2	codex / cursor-bugbot	Final-guard corrupts `data.message` on successful responses	`90f5890869`
3	codex	`sanitize_external_error`'s own log call leaks raw text into `result["logs"]` (recorder side-channel); pre-existing log lines also leak	`7023a665fe`
4	codex	`ExecutionFailed.extra` returned verbatim; `scrub_log_entries` doesn't recurse	`8f03a0d97f`

How did I test this PR

Scoped pytest (754 passed, 1 pre-existing deselected):

$ pytest tests/test_tools/ tests/test_utils/test_sanitization.py \
         tests/test_utils/test_http.py tests/test_serverless/ \
         --deselect tests/test_serverless/test_execute_recipe.py::TestExecuteRecipe::test_execute_recipe_with_run_composio_tool
===== 754 passed, 6 skipped, 1 deselected in 75.44s =====

Deselected test is a pre-existing master failure (Weathermap config.json missing) — verified by stashing my changes and re-running.

New tests (tests/test_utils/test_sanitization.py, tests/test_serverless/test_handler_final_guard.py, tests/test_tools/test_api_tool.py):

Pattern coverage: ElastiCache / RDS / EC2-internal / k8s svc.cluster.local / Redis/Postgres/MongoDB/AMQP URIs / RFC 1918 / loopback / link-local / AWS metadata 169.254.169.254 / Composio service-host patterns.
Passthrough coverage: public IPs (8.8.8.8, 1.1.1.1), public AWS customer endpoints (*.elb.amazonaws.com, *.elasticbeanstalk.com, ec2-*.compute-1.amazonaws.com), third-party HTTPS URLs, validation errors, edge of RFC 1918 (172.15.x, 172.32.x), None / empty input.
Handler boundary: successful response with structured data.message (Gmail-style) passes through verbatim; failure response with internal-infra leak is scrubbed end-to-end; outer-level error string is scrubbed; non-string outer error is left alone; log-recorder entries with nested internal-infra extras are scrubbed; sanitize_external_error's own log call does not land in result["logs"].
End-to-end APITool: a ConnectionError raising the exact bug-report message produces a response where the leaked hostname appears NOWHERE in the serialized output; ExecutionFailed.extra with leaky nested values is recursively scrubbed and extra["error"] cannot override the official error field.
Regression: TestRegressionBugReport::test_full_bug_report_message pins the exact response shape from the original report.

Lint / typecheck:

$ make chk           # ruff format + lint + mypy: PASSED
$ make snt           # all CI sanity checks: PASSED

E2E note: the local Apollo + Thermos stack routes to staging/prod Mercury Lambda; internal-infra connection errors are not injectable from the public API path. The unit + handler-boundary tests cover the actual code surface.

Triggered by: Srujan A srujan@composio.dev | Source: slack Session: https://zen-api-production-4c98.up.railway.app/dashboard/#/chat/zen-8ec724feef3b

🤖 Generated with Claude Code

@venkat821mo ago

Verified false-positive regressions before merge

Took a closer look at how the patterns interact with existing production action code. Three concrete cases where the whole-message replacement will blank legitimate customer-facing errors:

1. apps/prisma/actions/execute_sql_command.py:123-129 — explicit helpful ValueError:

raise ValueError(
    "Prisma Accelerate URLs (prisma+postgres://accelerate.prisma-data.net/...) are not supported. "
    "Use a direct PostgreSQL connection string instead. "
    "...format as 'postgresql://USER:PASS@HOST/postgres?sslmode=require'..."
)

Both postgres:// and postgresql:// match _INTERNAL_URI_SCHEME_RE. This propagates to APITool.execute's generic Exception branch and gets replaced with the generic message — the customer loses the actionable guidance about which URL format to use.

2. apps/supabase/actions/patch_network_restrictions.py:206-212 — the action's purpose is to manage RFC 1918 CIDR ranges:

raise ExecutionFailed(
    message=(f"Failed to update network restrictions. "
             f"Status code: {response.status_code}. Response: {response.text}"),
    ...
)

Customer's own input is [\"192.168.1.0/24\", \"10.0.0.0/8\"] (per the action's own examples=). When Supabase rejects it, response.text echoes the customer's input — _INTERNAL_IP_RE fires and the entire error message is wiped. Customer can't see why their CIDR was rejected.

3. apps/ipinfo_io/actions/ipinfo_*.py — these actions exist explicitly to look up private/bogon IPs (per their own docstring: "Detect bogon/private IPs (192.168.x.x, 10.x.x.x, etc.)"). ExecutionFailed(message=f\"Failed to parse response as JSON: {response.text}\") will echo a private IP from the upstream response and get sanitized away.

FYI — Pydantic validation errors are not a concern: parse_pydantic_error (mercury/utils/pydantic.py:31-45) only includes error[\"msg\"] and the parameter name, not the schema's examples=/description=. So the many strings in across sendgrid/servicenow/vercel won't leak through that path.

loading diff…

@venkat821mo ago

Verified false-positive regressions before merge

Took a closer look at how the patterns interact with existing production action code. Three concrete cases where the whole-message replacement will blank legitimate customer-facing errors:

1. apps/prisma/actions/execute_sql_command.py:123-129 — explicit helpful ValueError:

raise ValueError(
    "Prisma Accelerate URLs (prisma+postgres://accelerate.prisma-data.net/...) are not supported. "
    "Use a direct PostgreSQL connection string instead. "
    "...format as 'postgresql://USER:PASS@HOST/postgres?sslmode=require'..."
)

2. apps/supabase/actions/patch_network_restrictions.py:206-212 — the action's purpose is to manage RFC 1918 CIDR ranges:

raise ExecutionFailed(
    message=(f"Failed to update network restrictions. "
             f"Status code: {response.status_code}. Response: {response.text}"),
    ...
)

Description

What changed

Reviewer feedback addressed

How did I test this PR

Verified false-positive regressions before merge

Description

What changed

Reviewer feedback addressed

How did I test this PR

Verified false-positive regressions before merge

Suggested fix