Description
unmatched_cluster_processor.process_unmatched_cluster cleared
best_plan_copy.cached_plan_id = None immediately before returning, which forced
build_turbopuffer_rows (at usecase_ingestion_service.py:1696-1700) down the
uuid.uuid4().hex fallback. The same logical cluster re-ingested across runs
therefore minted a fresh TPuf row every time.
Production trace (driving evidence): in usecase_plans_prod_new_v1 today
there are 29 byte-identical rows for
usecase: search web for stock market news and analysis for a specific date. known: date. apps: composio_search
Creation timestamps spread across 5+ months (2025-11-26 → 2026-05-08),
median gap ~3-4 days between dups. None of the 29 ids match
generate_plan_id(usecase) — they're all 32-char UUIDs. Same pattern shows up
for the read-google-doc cluster (50 rows / 3 wording variants → 1 intent) and
many others. Corpus-wide: 15,364 unique usecases / 16,641 rows → ~1,810
wasted index slots.
generate_plan_id itself is deterministic (sha256(usecase.lower().strip())[:16]);
the duplicates accumulate because the deterministic id was computed but only
used in the LLM prompt + log line — never persisted. The recent
c1ccc2453 fix
closed the transient-search-error dup path (success=False → skip); this PR
closes the legitimate-no-match path that runs on every successful "no
acceptable match" result.
Fix (surgical)
After consolidation finishes (so we know the post-consolidation
toolkits), compute a deterministic synthetic id from the three pieces that
together identify the row:
is_specific_plan = bool(
(best_plan_search_id := getattr(best_plan, "search_id", None))
and best_plan_search_id.endswith("_specific")
)
synthetic_id = generate_plan_id(
f"{cluster['representative_usecase']}::{toolkit_str}",
is_specific=is_specific_plan,
)
# ...
best_plan_copy.cached_plan_id = synthetic_id # was None
Why those three components:
representative_usecase — same intent text → same id base, so an
identical cluster re-ingested on a later run upserts onto the prior row.
- post-consolidation
toolkit_str — guards the case where the same
usecase text genuinely maps to a different final toolkit scope on a later
run (e.g. composio_search vs composio_search,exa). Mirrors the
matched path's split-on-toolkit-mismatch behaviour at
cached_plan_processor.py:387-396. Without this, a single deterministic
id would let a divergent cluster overwrite an existing row of different
shape.
is_specific — the _specific variant (best_plan.search_id ends with
_specific) keeps its _specific suffix from generate_plan_id(... , is_specific=True) and never collides with the base variant.
Existing paths that intentionally clear cached_plan_id (most notably
usecase_ingestion_service.py:2273, which reroutes plans whose cached id no
longer exists in TurboPuffer back to clustering) are untouched. The
matched-cluster path's behaviour at
cached_plan_processor.py:387-396 (uuid4 when stored toolkits differ from
post-consolidation toolkits) is preserved — it operates on a different code
path and isn't reached here.
The fix is 2 line-edits + a comment in
app_tester/rube_learning/ingestion_pipeline/unmatched_cluster_processor.py.
How did I test this PR
- New focused test file
app_tester/rube_learning/ingestion_pipeline/tests/test_unmatched_cluster_deterministic_id.py
with 4 cases, all green:
test_identical_cluster_across_runs_produces_same_cached_plan_id
— the direct regression for the 29-plan dup cluster.
test_specific_plan_gets_specific_suffix — exercises the _specific
edge case explicitly.
test_same_usecase_different_consolidated_toolkits_separates_ids —
verifies that genuinely-different toolkit scopes stay on separate ids
(the safety the user flagged).
test_cached_plan_id_is_persisted_not_cleared — bare regression for the
line-289 bug.
- Verified all 4 tests fail on master (stash + re-run) and pass on the
branch — confirms they actually exercise the fix.
- Ran the whole
app_tester/rube_learning/ingestion_pipeline/tests/
directory: 61 pass; the 5 pre-existing failures (test_cached_plan_*,
test_plan_consolidation, test_plan_extraction, test_tool_consolidation)
also fail on plain master — they require live ClickHouse / OpenAI env
vars and are unrelated to this change.
ruff check clean on both files, ruff format --check clean.
Follow-up (out of scope here)
- After this lands, the ~1,810 already-duplicated rows still exist in
TurboPuffer. A small one-time migration job could re-key the prior 32-char
UUIDs to their deterministic 16-char hash and consolidate the cluster
contents — keeping the same set of upserts the consolidator would do
organically, just doing them up front. Happy to send that as a separate
PR once this contract change is in.
- The matched-path filter at
apollo/.../planSearch/vector_search.ts:50
uses Eq on the comma-joined toolkit string, which is the upstream reason
the matcher returns "no match" for clusters whose pre-consolidation
toolkit list differs from the stored row's post-consolidation one.
Switching that to a subset/contains filter on toolkits_array would let
the matched path catch most of these clusters before they ever reach the
unmatched path. Will send as a separate Apollo PR once we agree on the
filter semantics.
Triggered by: dhawal@composio.dev | Source: cron-fe3a7d9c5460
Session: https://zen-api-production-4c98.up.railway.app/dashboard/#/chat/zen-cron-3f646de023f5
Origin: cron-fe3a7d9c5460 / zen-cron-3f646de023f5