fix(serverless): cut Lambda memory + 401 burn on poll_trigger

@zen-agentchecks n/achecks…zen/lambda-mem-auth-oshaf1 → master7 files · +798 −24updated 1mo ago

▸Description

Description

Two production fixes to Mercury's poll_trigger path, plus a test infrastructure fix the SSRF-guard rollout left dangling.

1. Lambda memory spikes (1.5–2 GB) from failed S3 offload

The large-payload offload in mercury/serverless/large_payload.py was holding the same poll response in memory three times during upload — the original result["result"] dict, the json.dumps(...).encode() bytes, and a copy written to /tmp (which on Lambda is in-memory tmpfs) — plus boto3's multipart send buffer on top. When the upload to large-lambda-payloads-prd failed, the bare except Exception caught the error, logged only str(e) (which truncates BotoError's dict repr), and silently returned the unmodified result with the full payload still in RAM. Multiplied across ~40 concurrent invocations, this is what produces the 1938/1956/2068 MB memory spikes.

Changes:

New S3.put_object_bytes(body, key, bucket) in mercury/storage/s3.py — uploads in-memory bytes via client.put_object directly, skipping the tempfile + boto3 file-reread.
Rewrite _upload in large_payload.py to call put_object_bytes, halving peak RAM during offload.
Drop the result["result"] reference before issuing the upload so we never hold both the dict and its serialized bytes simultaneously. On failure, restore from raw_bytes only when the payload still fits Lambda's 6 MB limit; otherwise leave it None and surface a structured error.
Bounded retry (3 attempts, 0.2 s → 0.4 s → 0.8 s) for transient errors only (ServiceUnavailable, SlowDown, Throttling*, RequestTimeout, 5xx, EndpointConnectionError, *TimeoutError). Terminal errors (AccessDenied, NoSuchBucket, EntityTooLarge, …) fail fast — retrying just burns Lambda time.
Failure log now includes error_code, http_status, request_id, bucket, key, size — enough to diagnose the actual upload failure in prod logs.

2. Wasted compute from 401s on stale/missing credentials

APITool.poll() previously instantiated and ran the trigger no matter what was in the auth dict. When Apollo scheduled a poll for a connection whose credentials had been cleared or deleted, Mercury still made the HTTP call, GitHub/Gmail returned 401, and we billed real GB-seconds for a doomed request.

Adds an _is_auth_payload_empty(auth) guard at the top of poll() (after the trigger-type check) that catches:

auth is None or {}
auth has no headers / params / query / body
Authorization header is present but the value is empty or just the scheme word ("", "Bearer", "token", "Bearer ", …)

When matched, returns auth_refresh_required: True immediately with error: "auth credentials missing or empty" — no trigger instantiation, no HTTP call. The check is intentionally narrow: stale-but-present tokens (real OAuth expiry) still flow through and are handled by the existing 401/403 detection in _parse_http_error.

Out of scope (Thermos follow-up)

The 401 volume from stale-but-present OAuth tokens cannot be fixed in Mercury alone — there is no token-validation call cheaper than the API call itself. Mercury already signals auth_refresh_required: True correctly for those. The remaining work is on the Apollo/Thermos side: act on that signal (refresh, retry, disable trigger after N consecutive failures) rather than re-scheduling another poll with the same dead token. Filing this as a follow-up.

3. CI unblock: SSRF guard breaks legacy http test patches

While iterating on this PR, CI started failing 23 tests in tests/test_utils/test_http.py that aren't touched by this branch. RCA: the mercurySsrfGuardEnforced LaunchDarkly flag was flipped to True at the LD level after master's last green run. The legacy-path tests there patch mercury.utils.http.requests.request, but the SSRF-guarded path issues calls via session.request on a freshly-built requests.Session — the patch never fires, real requests hit https://example.com, and assertions fail with <Response [200]> == <MagicMock>. Master would now fail identically; this PR just happened to be the first to hit it.

Fix shipped here as a separate commit: tests/test_utils/conftest.py autouse fixture forcing _is_ssrf_guard_enforced to return False for the whole test_utils/ package. The dedicated tests/test_utils/test_http_ssrf.py suite still opts the flag back on per-test, so SSRF-guard coverage is preserved.

How did I test this PR

Added test files / cases:

tests/test_serverless/test_large_payload.py (new, 8 tests): happy-path round-trip via moto, no-tempfile invariant, transient-retry-then-success, terminal-error-no-retry, small-payload restore on failure, oversized-payload null on failure, missing-bucket handling.
tests/test_storage/test_s3.py: 2 new cases covering put_object_bytes round-trip + BotoError propagation.
tests/test_tools/test_api_tool.py: 3 new cases on APITool.poll (short-circuits on {}, short-circuits on Bearer with no token, lets through real Authorization value) plus a TestIsAuthPayloadEmpty parametrized class with 16 positive/negative cases.
tests/test_utils/conftest.py (new): SSRF-guard-disabling autouse fixture for legacy-path http tests.

CI on the latest commit (9a1e4e2348):

Check	Result
Tests	passed
Lint and type checks	passed
Checks	passed
Trigger Config	passed
Endpoint Duplicate CI	passed
Google Docs Markdown renderer	passed
Analyze (actions)	passed
Analyze (javascript-typescript)	passed
Analyze (python)	still in flight at writing
`secrets-detection.yml`	workflow-startup failure, not a finding — fails identically on every recent PR (cortex/fix/googlesheets, cortex/fix/notion). Not introduced by this branch.

Local verification:

$ pytest tests/test_serverless/test_large_payload.py tests/test_storage/test_s3.py tests/test_tools/test_api_tool.py tests/test_utils/test_http.py
232 passed

loading diff…