fix(rube/learning): retry on full 5xx range in flex_chat_completions_parse

@zen-agentchecks n/achecks…zen/retry-cf-5xx-yr1i2u → master2 files · +91 −1updated 1mo ago

▸Description· 1 comment

Description

flex_chat_completions_parse only retries status in (500, 502, 503, 504) — Cloudflare-specific 5xx codes (520-527, 530) returned by the OpenAI edge when origin is unreachable fall through to the generic error branch and return None after a single attempt.

Today's 24h scan caught one such failure on the workflow analysis path:

[2026-05-05 14:33:43,266][ERROR] OpenAI API error (status=520) on attempt 1: <!DOCTYPE html>
[2026-05-05 14:33:43,266][ERROR] Workflow analysis returned no output

Status 520 is Cloudflare's "origin returned empty/invalid response" — equivalent to a transient 502 from our perspective and should be retried like the other 5xx codes already are. Same backoff path also handles 521-527/530 in case the edge surfaces them on a future incident.

The fix is a one-line range check (500 <= status < 600) plus a parametrized test covering both the originally-handled codes (500, 502, 503, 504) and the formerly-unretried Cloudflare codes (520-527, 530), and a regression test that asserts non-5xx (e.g. 401) is still NOT retried.

Origin: cron-5d55c321e47a / zen-cron-e587efbf1491

How did I test this PR

$ source .venv/bin/activate && python -m pytest \
    app_tester/rube_learning/tests/test_flex_llm_5xx_retry.py -v
============================== 14 passed in 6.98s ==============================

All 14 tests pass:

test_5xx_retried_then_succeeds[500..504, 520..527, 530] — every 5xx code retries and recovers on attempt 2
test_non_5xx_not_retried — 401 is logged once and returns None (negative regression test)

ruff check + ruff format --check are clean on both touched files.

Pre-existing CI failures

Test - Learning Pipeline ✅ passes.

Lint - Integrator ❌ fails on this PR — but the same check is already failing on master (unbroken by this PR; verified by gh api repos/ComposioHQ/integrator/actions/runs?branch=master, last 10 runs all fail). The errors are in unrelated files (workflow_analysis/runner.py, pii_mask.py, untrusted_input.py, test_workflow_search_no_truncate.py, test_search_turbopuffer_optimized.py, test_tool_consolidation*.py) — none of which are touched by this PR. Per repo CI conventions: pre-existing failures on master are noted and not blocking.

Triggered by: dhawal@composio.dev | Source: cron Session: https://zen-api-production-4c98.up.railway.app/dashboard/#/chat/zen-cron-e587efbf1491

@zen-agent1mo ago

PR status update

Build checks (CI lint commands run locally on changed files only):

Check	Result
`black --check app_tester/rube_learning/flex_llm.py app_tester/rube_learning/tests/test_flex_llm_5xx_retry.py`	✅ 2 files would be left unchanged
`isort --check` (same files)	✅ no errors
`flake8 --config tox.ini` (same files)	✅ no errors
`mypy --check-untyped-defs --config-file tox.ini` (same files)	✅ no errors in the touched code (only transitive complaints from unrelated files like `pii_mask.py`, `clickhouse_service.py`, `usecase_ingestion_service.py`)
`pytest` on the new test file	✅ 14 passed in 6.77s

Tests: 14 parametrized cases cover the originally-handled codes (500/502/503/504) and the newly-covered Cloudflare codes (520/521/522/523/524/525/526/527/530), plus a 401 negative regression to confirm non-5xx is still NOT retried.

E2E/runtime check: Not applicable — flex_llm.flex_chat_completions_parse is OpenAI client-side retry logic. Reproducing a real Cloudflare 520 would require controlling the OpenAI edge. The unit tests with a mocked APIStatusError(status_code=520) are the highest-fidelity test possible without an actual outage.

CI status:

✅ Test - Learning Pipeline — passes
❌ Lint - Integrator — pre-existing failure on master, NOT introduced by this PR. Verified by comparing this PR's lint log against master's last successful-build attempt at 51ef296 (run 24939883584): the same 4 sub-tasks (isort-check, black-check, flake8, mypy) fail in the same files on both runs. None of the failing files (workflow_analysis/runner.py, pii_mask.py, , , , , ) are touched by this PR. Per repo CI conventions, pre-existing master failures are noted and not blocking.

loading diff…

@zen-agent1mo ago

PR status update

Build checks (CI lint commands run locally on changed files only):

Check	Result
`black --check app_tester/rube_learning/flex_llm.py app_tester/rube_learning/tests/test_flex_llm_5xx_retry.py`	✅ 2 files would be left unchanged
`isort --check` (same files)	✅ no errors
`flake8 --config tox.ini` (same files)	✅ no errors
`mypy --check-untyped-defs --config-file tox.ini` (same files)	✅ no errors in the touched code (only transitive complaints from unrelated files like `pii_mask.py`, `clickhouse_service.py`, `usecase_ingestion_service.py`)
`pytest` on the new test file	✅ 14 passed in 6.77s

CI status:

✅ Test - Learning Pipeline — passes
❌ Lint - Integrator — pre-existing failure on master, NOT introduced by this PR. Verified by comparing this PR's lint log against master's last successful-build attempt at 51ef296 (run 24939883584): the same 4 sub-tasks (isort-check, black-check, flake8, mypy) fail in the same files on both runs. None of the failing files (workflow_analysis/runner.py, pii_mask.py, , , , , ) are touched by this PR. Per repo CI conventions, pre-existing master failures are noted and not blocking.