fix(rube/learning): coerce non-str inputs in mask_pii to JSON

@zen-agentchecks n/achecks…zen/mask-pii-coerce-non-str-yzjqvq → master2 files · +128 −10updated 3w ago

▸Description· 2 comments

Description

The protect/restore regex pass in _protect_technical_patterns calls re.Pattern.finditer(text), which raises TypeError: expected string or bytes-like object, got 'list' whenever a caller passes a list/dict to mask_pii. The fail-closed except branch then catches the error, drops the entire field's content, and emits a misleading traceback in the logs.

Concrete trigger paths:

safe_search_str(q.get("known_fields", "")) in common/search_format.py — agents sometimes supply known_fields as a list rather than the documented comma-separated string.
mask_pii(sanitize_for_llm(plan.get("known_fields", ""))) in ingestion_pipeline/plan_consolidator.py (same data shape).

Production evidence (last 7 days, service:learning-pipeline):

Event	Count
`[WARNING] PII masking failed; dropping content`	17
`TypeError: expected string or bytes-like object, got 'list'` surfaced via `exc_info`	1 (rest masked by `try` swallow)
Cycle/session impact	Affected sessions ran but the affected field rendered as empty `""` in the LLM prompt

Fix

Coerce non-string inputs to a JSON string at the top of mask_pii before the regex pass. Preserves the structured content for real Presidio masking (emails, phones, SSNs, credit cards in nested dicts get detected) instead of silently losing it via the fail-closed branch.

Widens the parameter type from str to Any to reflect actual call sites.
Coerces non-str inputs via json.dumps(value, default=str, ensure_ascii=False), falling back to str(value) if that fails.
Tightens the empty-input branch to always return "" (so callers that f-string the result aren't fed back a list/dict/None).
Follow-up commit (02f253d6): Codex review + Cursor Bugbot both flagged that the coercion path sat outside the fail-closed try, so a hostile object whose __str__ raises (e.g. RuntimeError, RecursionError from deeply nested input) would escape uncaught. Moved coercion inside the existing try/except Exception so all failures funnel through the same logger.warning(..., exc_info=True); return "" exit.

How did I test this PR

Ran the scoped unit-test file in the integrator venv with presidio_analyzer + en_core_web_sm installed:

$ source .venv/bin/activate
$ python -m pytest app_tester/rube_learning/tests/test_pii_mask.py -v
=== 21 passed in 6.74s ===

New cases added in this PR (all passing):

test_mask_pii_falsy_inputs_return_empty_string[None|""|[]|{}|0] — every falsy input funnels to "".
test_mask_pii_list_input_is_masked_via_json_coercion — regression test for the exact production traceback. Verifies the list-shaped input no longer crashes, emails inside are masked, and non-PII content survives.
test_mask_pii_dict_input_is_masked_via_json_coercion — same for dict input.
test_mask_pii_arbitrary_python_object_does_not_crash — defensive coverage for the default=str JSON fallback.
test_mask_pii_object_with_raising_str_fails_closed — locks down the fail-closed contract for hostile objects whose __str__ raises (Bugbot/ Codex hardening from the follow-up commit).

Lint pipeline locally: ruff check, ruff format --check, isort --check, black --check, flake8 — all clean on the two changed files. mypy reports 5 pre-existing errors on master files unrelated to this change. Pre-existing isort/black drift on untrusted_input.py and a handful of test files is intentionally left untouched so this PR stays scoped to the bug.

Origin: cron-5d55c321e47a / zen-cron-87e92f88c8e2 Triggered by: dhawal@composio.dev | Source: cron

@zen-agent1mo ago

Status update

Codex review (foreground) + Cursor Bugbot (asynchronous) both reviewed the PR and surfaced two fail-closed corner cases. Both are now fixed:
- 02f253d6 — moved JSON/str coercion inside the fail-closed try (catches RecursionError from deep nesting, custom __str__ raising RuntimeError, etc.).
- 6ea3b8fc — moved the if not text truthiness guard inside the same try (catches numpy multi-element arrays / hostile __bool__ raising ValueError).
All 22 unit tests pass locally. ruff/isort/black/flake8 clean on the two changed files.
test-learning CI: PASS ✅
run-lint CI: FAIL ❌ — this check is pre-existing on master (verified via master run #24939883584). All listed errors are in files this PR does not touch (untrusted_input.py isort, 6 unrelated files black, workflow_analysis/runner.py:17 flake8 F401, 11 mypy errors across 5 unrelated files). Not actionable on this PR.

@zen-agent1mo ago

Final CI status (commit 6ea3b8fc):

✅ test-learning: PASS
❌ run-lint: FAIL — identical to master. Specific failures (none in files this PR touches):
- isort-check: untrusted_input.py imports out of order
- black-check: 6 unrelated files need reformatting (untrusted_input.py, tool_consolidator.py, 4 test files)
- flake8: F401 imported but unused

loading diff…

Description

Concrete trigger paths:

safe_search_str(q.get("known_fields", "")) in common/search_format.py — agents sometimes supply known_fields as a list rather than the documented comma-separated string.

mask_pii(sanitize_for_llm(plan.get("known_fields", ""))) in ingestion_pipeline/plan_consolidator.py (same data shape).

Production evidence (last 7 days, service:learning-pipeline):

Event

Count

[WARNING] PII masking failed; dropping content

TypeError: expected string or bytes-like object, got 'list' surfaced via exc_info

1 (rest masked by try swallow)

Cycle/session impact

Affected sessions ran but the affected field rendered as empty "" in the LLM prompt

Fix

Widens the parameter type from str to Any to reflect actual call sites.

Coerces non-str inputs via json.dumps(value, default=str, ensure_ascii=False), falling back to str(value) if that fails.

Tightens the empty-input branch to always return "" (so callers that f-string the result aren't fed back a list/dict/None).

Follow-up commit (02f253d6): Codex review + Cursor Bugbot both flagged that the coercion path sat outside the fail-closed try, so a hostile object whose __str__ raises (e.g. RuntimeError, RecursionError from deeply nested input) would escape uncaught. Moved coercion inside the existing try/except Exception so all failures funnel through the same logger.warning(..., exc_info=True); return "" exit.

How did I test this PR

Ran the scoped unit-test file in the integrator venv with presidio_analyzer + en_core_web_sm installed:

$ source .venv/bin/activate $ python -m pytest app_tester/rube_learning/tests/test_pii_mask.py -v === 21 passed in 6.74s ===

New cases added in this PR (all passing):

test_mask_pii_falsy_inputs_return_empty_string[None|""|[]|{}|0] — every falsy input funnels to "".

test_mask_pii_list_input_is_masked_via_json_coercion — regression test for the exact production traceback. Verifies the list-shaped input no longer crashes, emails inside are masked, and non-PII content survives.

test_mask_pii_dict_input_is_masked_via_json_coercion — same for dict input.

test_mask_pii_arbitrary_python_object_does_not_crash — defensive coverage for the default=str JSON fallback.

test_mask_pii_object_with_raising_str_fails_closed — locks down the fail-closed contract for hostile objects whose __str__ raises (Bugbot/ Codex hardening from the follow-up commit).