fix(rube/ingestion): persist deterministic plan_id on unmatched clusters

@zen-agentchecks n/achecks…zen/fix-unmatched-cluster-deterministic-id-drr599 → master4 files · +512 −4updated 1mo ago

▸Description· 1 comment

Description

unmatched_cluster_processor.process_unmatched_cluster cleared best_plan_copy.cached_plan_id = None immediately before returning, which forced build_turbopuffer_rows (at usecase_ingestion_service.py:1696-1700) down the uuid.uuid4().hex fallback. The same logical cluster re-ingested across runs therefore minted a fresh TPuf row every time.

Production trace (driving evidence): in usecase_plans_prod_new_v1 today there are 29 byte-identical rows for

usecase: search web for stock market news and analysis for a specific date. known: date. apps: composio_search

Creation timestamps spread across 5+ months (2025-11-26 → 2026-05-08), median gap ~3-4 days between dups. None of the 29 ids match generate_plan_id(usecase) — they're all 32-char UUIDs. Same pattern shows up for the read-google-doc cluster (50 rows / 3 wording variants → 1 intent) and many others. Corpus-wide: 15,364 unique usecases / 16,641 rows → ~1,810 wasted index slots.

generate_plan_id itself is deterministic (sha256(usecase.lower().strip())[:16]); the duplicates accumulate because the deterministic id was computed but only used in the LLM prompt + log line — never persisted. The recent c1ccc2453 fix closed the transient-search-error dup path (success=False → skip); this PR closes the legitimate-no-match path that runs on every successful "no acceptable match" result.

Fix (surgical)

After consolidation finishes (so we know the post-consolidation toolkits), compute a deterministic synthetic id from the three pieces that together identify the row:

is_specific_plan = bool(
    (best_plan_search_id := getattr(best_plan, "search_id", None))
    and best_plan_search_id.endswith("_specific")
)
synthetic_id = generate_plan_id(
    f"{cluster['representative_usecase']}::{toolkit_str}",
    is_specific=is_specific_plan,
)
# ...
best_plan_copy.cached_plan_id = synthetic_id   # was None

Why those three components:

representative_usecase — same intent text → same id base, so an identical cluster re-ingested on a later run upserts onto the prior row.
post-consolidation toolkit_str — guards the case where the same usecase text genuinely maps to a different final toolkit scope on a later run (e.g. composio_search vs composio_search,exa). Mirrors the matched path's split-on-toolkit-mismatch behaviour at cached_plan_processor.py:387-396. Without this, a single deterministic id would let a divergent cluster overwrite an existing row of different shape.
is_specific — the _specific variant (best_plan.search_id ends with _specific) keeps its _specific suffix from generate_plan_id(... , is_specific=True) and never collides with the base variant.

Existing paths that intentionally clear cached_plan_id (most notably usecase_ingestion_service.py:2273, which reroutes plans whose cached id no longer exists in TurboPuffer back to clustering) are untouched. The matched-cluster path's behaviour at cached_plan_processor.py:387-396 (uuid4 when stored toolkits differ from post-consolidation toolkits) is preserved — it operates on a different code path and isn't reached here.

The fix is 2 line-edits + a comment in app_tester/rube_learning/ingestion_pipeline/unmatched_cluster_processor.py.

How did I test this PR

New focused test file app_tester/rube_learning/ingestion_pipeline/tests/test_unmatched_cluster_deterministic_id.py with 4 cases, all green:
- test_identical_cluster_across_runs_produces_same_cached_plan_id — the direct regression for the 29-plan dup cluster.
- test_specific_plan_gets_specific_suffix — exercises the _specific edge case explicitly.
- test_same_usecase_different_consolidated_toolkits_separates_ids — verifies that genuinely-different toolkit scopes stay on separate ids (the safety the user flagged).
- test_cached_plan_id_is_persisted_not_cleared — bare regression for the line-289 bug.
Verified all 4 tests fail on master (stash + re-run) and pass on the branch — confirms they actually exercise the fix.
Ran the whole app_tester/rube_learning/ingestion_pipeline/tests/ directory: 61 pass; the 5 pre-existing failures (test_cached_plan_*, test_plan_consolidation, test_plan_extraction, test_tool_consolidation) also fail on plain master — they require live ClickHouse / OpenAI env vars and are unrelated to this change.
ruff check clean on both files, ruff format --check clean.

Follow-up (out of scope here)

After this lands, the ~1,810 already-duplicated rows still exist in TurboPuffer. A small one-time migration job could re-key the prior 32-char UUIDs to their deterministic 16-char hash and consolidate the cluster contents — keeping the same set of upserts the consolidator would do organically, just doing them up front. Happy to send that as a separate PR once this contract change is in.
The matched-path filter at apollo/.../planSearch/vector_search.ts:50 uses Eq on the comma-joined toolkit string, which is the upstream reason the matcher returns "no match" for clusters whose pre-consolidation toolkit list differs from the stored row's post-consolidation one. Switching that to a subset/contains filter on toolkits_array would let the matched path catch most of these clusters before they ever reach the unmatched path. Will send as a separate Apollo PR once we agree on the filter semantics.

Triggered by: dhawal@composio.dev | Source: cron-fe3a7d9c5460 Session: https://zen-api-production-4c98.up.railway.app/dashboard/#/chat/zen-cron-3f646de023f5 Origin: cron-fe3a7d9c5460 / zen-cron-3f646de023f5

@zen-agent1mo ago

CI lint status

Pushed 1e192c8c2 fixing one real mypy --check-untyped-defs finding in my new test file (test_unmatched_cluster_deterministic_id.py:381 — MagicMock .return_value attribute access requires # type: ignore[attr-defined]).

The remaining run-lint failures on this PR are pre-existing on master: the Lint - Integrator workflow has been red on master since at least 2026-04-25 (51ef2967 and 1cb5e073). The failing files / errors don't intersect with this PR's diff:

pii_mask.py:195 — Presidio version mismatch
usecase_ingestion_service.py:30 — types-redis not installed
untrusted_input.py, test_search_turbopuffer_optimized.py, test_search_vector_cache_raises.py, test_tool_consolidation_skip_on_llm_failure.py, tool_consolidator.py, test_untrusted_input.py — all flagged by black --check for formatting drift that exists on master

Verified locally on the 4 files this PR touches:

flake8  → clean
black   → 4 files would be left unchanged
isort   → clean
mypy    → all errors are in transitively imported pre-existing files

So the lint job will still be red on this PR after the push (because of the pre-existing failures), but no new lint failures are introduced by these changes.

loading diff…

@zen-agent1mo ago

CI lint status

pii_mask.py:195 — Presidio version mismatch
usecase_ingestion_service.py:30 — types-redis not installed
untrusted_input.py, test_search_turbopuffer_optimized.py, test_search_vector_cache_raises.py, test_tool_consolidation_skip_on_llm_failure.py, tool_consolidator.py, test_untrusted_input.py — all flagged by black --check for formatting drift that exists on master

Verified locally on the 4 files this PR touches:

flake8  → clean
black   → 4 files would be left unchanged
isort   → clean
mypy    → all errors are in transitively imported pre-existing files

So the lint job will still be red on this PR after the push (because of the pre-existing failures), but no new lint failures are introduced by these changes.