Post-PR status update
Codex review loop: 4 iterations, 5 findings, all addressed:
- Toolkit-group collision → include
toolkit_keyin synthetic_id - Known-fields collision → include
representative_known_fields - Empty/
"None"/"null"canonicalization for no-known-fields clusters - Test decoupled from heavy ingestion deps (sha256[:16] re-derived in test)
- Resume mode:
_load_resume_datanow preservestoolkit_keyandrepresentative_known_fields; processor falls back totoolkits_strderived from cluster_plans whentoolkit_keyis absent
Tests: 4 standalone tests all pass (test_synthetic_id_distinguishes_toolkit_groups, ..._distinguishes_known_fields, ..._persists_synthetic_id_on_returned_plan, ..._empty_none_null_known_fields_collapse_to_same_id). Source-level guards lock in the contract.
Local lint on changed files (black 23.3.0 + isort 5.12.0 + flake8 6.0.0): all clean.
CI:
- ✅
test-learning(Test - Learning Pipeline) — passing - ⚠️
run-lint(Lint - Integrator) — pre-existing failure on master since 2026-04-21 (last master green run was commite2c0971fon 2026-04-20). Pre-existing issues are in unrelated files:app_tester/rube_learning/common/pii_mask.py,..common/untrusted_input.py,..ingestion_pipeline/tests/test_search_vector_cache_raises.py,..ingestion_pipeline/tests/test_search_turbopuffer_optimized.py,..ingestion_pipeline/tests/test_tool_consolidation_skip_on_llm_failure.py,..ingestion_pipeline/tool_consolidator.py,..tests/test_untrusted_input.py, plus mypy errors throughout. My changed files (unmatched_cluster_processor.py, the new test file, and the 1-block_load_resume_dataedit inusecase_ingestion_service.py) pass black/isort/flake8 locally.
Manual cleanup follow-up (not in this PR): the 1,987 existing dup rows in TurboPuffer were created under the old uuid4 fallback and won't be reaped by this fix on their own — they need a one-time backfill that re-keys each dup row's content to its new deterministic ID. Happy to file a separate PR or one-off script for that on request.