Description
The protect/restore regex pass in _protect_technical_patterns calls
re.Pattern.finditer(text), which raises
TypeError: expected string or bytes-like object, got 'list' whenever a caller
passes a list/dict to mask_pii. The fail-closed except branch then catches
the error, drops the entire field's content, and emits a misleading
traceback in the logs.
Concrete trigger paths:
safe_search_str(q.get("known_fields", "")) in common/search_format.py —
agents sometimes supply known_fields as a list rather than the documented
comma-separated string.
mask_pii(sanitize_for_llm(plan.get("known_fields", ""))) in
ingestion_pipeline/plan_consolidator.py (same data shape).
Production evidence (last 7 days, service:learning-pipeline):
| Event | Count |
|---|
[WARNING] PII masking failed; dropping content | 17 |
TypeError: expected string or bytes-like object, got 'list' surfaced via exc_info | 1 (rest masked by try swallow) |
| Cycle/session impact | Affected sessions ran but the affected field rendered as empty "" in the LLM prompt |
Fix
Coerce non-string inputs to a JSON string at the top of mask_pii before the
regex pass. Preserves the structured content for real Presidio masking
(emails, phones, SSNs, credit cards in nested dicts get detected) instead of
silently losing it via the fail-closed branch.
- Widens the parameter type from
str to Any to reflect actual call sites.
- Coerces non-
str inputs via json.dumps(value, default=str, ensure_ascii=False),
falling back to str(value) if that fails.
- Tightens the empty-input branch to always return
"" (so callers that
f-string the result aren't fed back a list/dict/None).
- Follow-up commit (02f253d6): Codex review + Cursor Bugbot both flagged
that the coercion path sat outside the fail-closed
try, so a hostile
object whose __str__ raises (e.g. RuntimeError, RecursionError from
deeply nested input) would escape uncaught. Moved coercion inside the
existing try/except Exception so all failures funnel through the same
logger.warning(..., exc_info=True); return "" exit.
How did I test this PR
Ran the scoped unit-test file in the integrator venv with presidio_analyzer +
en_core_web_sm installed:
$ source .venv/bin/activate
$ python -m pytest app_tester/rube_learning/tests/test_pii_mask.py -v
=== 21 passed in 6.74s ===
New cases added in this PR (all passing):
test_mask_pii_falsy_inputs_return_empty_string[None|""|[]|{}|0] — every
falsy input funnels to "".
test_mask_pii_list_input_is_masked_via_json_coercion — regression test for
the exact production traceback. Verifies the list-shaped input no longer
crashes, emails inside are masked, and non-PII content survives.
test_mask_pii_dict_input_is_masked_via_json_coercion — same for dict input.
test_mask_pii_arbitrary_python_object_does_not_crash — defensive coverage
for the default=str JSON fallback.
test_mask_pii_object_with_raising_str_fails_closed — locks down the
fail-closed contract for hostile objects whose __str__ raises (Bugbot/
Codex hardening from the follow-up commit).
Lint pipeline locally: ruff check, ruff format --check, isort --check,
black --check, flake8 — all clean on the two changed files. mypy reports
5 pre-existing errors on master files unrelated to this change. Pre-existing
isort/black drift on untrusted_input.py and a handful of test files is
intentionally left untouched so this PR stays scoped to the bug.
Origin: cron-5d55c321e47a / zen-cron-87e92f88c8e2
Triggered by: dhawal@composio.dev | Source: cron