feat(rube/learning): per-user tool-learnings extractor (POC, observe-only)

@zen-agentchecks n/achecks…zen/user-learnings-poc-ddcbq9 → master26 files · +3440 −102updated 1mo ago

▸Description· 8 comments

Description

Add a per-(project_id, user_id, tool_slug) learnings layer on top of the existing global plans pipeline. Captures three things the global layer misses by construction:

A single user's recurring mistakes that never reach the global cluster threshold.
Custom-tool quirks that have no global corpus to learn from.
Connection-shaped facts (e.g. one customer's specific schema or workspace quirks).

Observe-only POC. The extractor writes to Redis but nothing is surfaced to agents in this PR — the goal is to observe what the system would learn before deciding whether to surface it.

Design (one screen)

Module: app_tester/rube_learning/user_learnings/
One new LLM call per session, after WorkflowAnalysisRunner. Reuses:
- flex_chat_completions_parse() — same gpt-5.2 + flex tier + retry policy as the existing four call sites
- common.pii_mask.mask_pii (Presidio, fail-closed) — same redaction as everywhere else
- common.untrusted_input.UNTRUSTED_INPUT_GUARD + wrap_untrusted + sanitize_for_llm — same prompt-injection defenses
- The "rename to neutral field" anchoring-mitigation trick from the global consolidator (pitfalls → observed_pitfalls); we use prior_observations_for_review
- The verbatim "absence may mean the guidance was followed successfully — do not vote delete on silence" clause from plan_consolidator.py
Operations (LLM-emitted, all evidence-tethered): create / increment / delete_vote. No confirm_applied — that mechanism waits for v2 along with opportunity_count and the surfacing path (see design doc for the rationale).
Counters: failure_repro_count, delete_votes. Soft-retire at delete_votes >= 3.
Storage: Redis hashes (TTL 60d, refreshed on positive ops). Append-only ops-log ring buffer for audit. Never HDEL — soft-retire keeps history for review.
Master gate: USER_LEARNINGS_ENABLED env var. Per-project allowlist: Redis set user_learnings:v1:enabled_projects (flip projects on without redeploy).
Backfill mode: new --user-learnings-only CLI flag — skips workflow + error analysis, runs only the new stage. Bypasses the env flag (the CLI flag itself is the master switch in this mode), still honours the project allowlist.
Analytics script: python -m app_tester.rube_learning.user_learnings.scripts.inspect_user_learnings prints top-level counts, per-tool and per-(project, user) histograms with sample texts, distribution of failure_repro_count and delete_votes, and the most recent ops from the log. Read-only.

Self-reinforcement defenses (the integrity rules)

Failure mode	Defense
One-off transient becomes permanent rule	`create` rejects HTTP >=500 / 429 / network / timeout; evidence required
Idiosyncratic becomes universal	Strict `(project, user, tool_slug)` scoping; never promotes outside the user/project
LLM anchored on existing learnings	Renamed-field trick + explicit no-op default + "absence != deletion" clause
Hallucinated evidence	Applier silently drops ops whose `evidence_call_id` is not in the session — every op type, not just `create`
Vague text creep	Specificity rule: text must mention a specific param, error string, or tool name
Stored text as injection vector	Sanitize on write (length cap + control-char strip); on any future surfacing, wrap as untrusted

How did I test this PR

19 unit tests in app_tester/rube_learning/user_learnings/tests/test_applier.py covering every applier rejection path:
- no_evidence, create_transient (parametrized over all 5 transient HTTP codes), create_transient (parametrized over timeout and network error classes), create_vague, write-time text sanitization, increment refresh + count, delete_vote at threshold (retires), delete_vote below threshold (active), increment_unknown_id, increment_no_id, delete_retired, mixed-batch counter aggregation.
- All 19 pass: 19 passed in 25.57s.
Lint clean: ruff check on the entire module + touched files passes; ruff format --check passes.
Compile clean: python -m py_compile on all 12 new + 2 modified files.
Existing test suite: the 8 failing tests in app_tester/rube_learning/tests/ are pre-existing on master (verified via git stash + git checkout master + same test) — unrelated to this PR.
Runtime: the extractor is is_master_enabled() gated and additionally is_project_enabled() gated; with USER_LEARNINGS_ENABLED=false (the default) the new stage is a zero-cost no-op. With it on but no project in the allowlist, every session short-circuits in the project check before any LLM call. Both checked in unit tests.
Service safety: every Redis call is wrapped in try/except (logged + swallowed); every extractor exception in pipeline.py is caught at the wrapper. The upstream pipeline cannot fail because of this PR.

Follow-ups (not in this PR)

Surfacing path in Apollo's COMPOSIO_SEARCH_TOOLS + opportunity_count denominator + ratio-based retirement.
Tool-version invalidation.
Challenge probes (suppress 5–10% to A/B test lessons).
Promotion path to org / global learnings.
edit / merge / unretire ops.
Deterministic repair-pair detector as a second extraction source.

@zen-agent1mo ago

Post-PR status

Reviews + fixes shipped

Codex review loop — 4 iterations, 13 issues found, 13 fixed:

Iteration	Issues	Severity	Status
1	3	3 P2	All fixed (commit `30822af`)
2	3	1 P1 + 2 P2	All fixed (commit `b7bb454`)
3	3	1 P1 + 2 P2	All fixed (commit `299b11b`)
4	4	2 P1 + 2 P2	All fixed (commit `2d0d5cf`)

Highlights:

Tool/evidence-call coherence check across all op types (no more cross-tool corruption)
Evidence-outcome rules (create/increment need a failure, delete_vote needs a success, transient HTTP/timeout/network rejected)
COMPOSIO_MULTI_EXECUTE_TOOL payload expansion into per-tool synthetic events with unique {log_id}__{tool_slug}__{index} ids — handles params.tools envelope, nested result.response.{successful,error} outcome, repeated tools in one call
UserLearningsStoreError raises on Redis failures so backfill mode reports success=False instead of silently dropping batches
Backfill mode (--user-learnings-only) bypasses MIN_CALLS / meta-tool-cap / no-exec / sampling filters

Cursor Bugbot — 5 inline comments, 4 fixed (1 was outdated by codex iter3):

Retire op-log entries now carry evidence_call_id + reason
Synthetic call_id includes index suffix (no collisions on repeated tools)
_print_json honors --top-tools / --top-users / --samples-per-tool
_classify_event_error skips body string-matching when status_code is non-transient (no false-positive 400 → "transient")

loading diff…

#	Severity	Finding	Fix
1	Medium	`_classify_event_error` could still tag a 4xx with body like `"Invalid connection_id"` as transient, because body-substring matching ran whenever `status_code` was None	Rewrote as 3 explicit rules: non-transient HTTP code wins immediately; transient HTTP code returns directly; body-matching only when status_code is None and on strict phrases (`"timed out"`, `"deadline exceeded"`, `"connection refused"`, `"connection reset"`, `"network is unreachable"`, `"reset by peer"`). Bare "connection"/"network"/"timeout" no longer match.
2	Low	`_run_user_learnings_stage` returned `False` on `redis_client is None` regardless of `force` — but in non-force mode this is a deliberate config skip, not a retryable failure	Returns `not force` so backfill correctly surfaces the failure and normal mode reports a clean skip.

@zen-agent1mo ago

Simplification pass + prod no-op tests

Per review feedback that the surface area was too big for an observe-only POC, ~1,100 lines removed (+818 / −1,786) without losing the integrity properties Codex/Bugbot validated. Commit aefd2bafc.

What got cut

Dropped UserLearningsStoreError and its raise/swallow contract. The master flag means prod is a no-op, so distinguishing "deliberate skip" from "transient Redis outage" was not worth the extra branching for the POC. All Redis errors now log + swallow at the store layer; the extractor doesn't try to translate them. Stale test_redis_store_errors.py removed.
Collapsed the ~20 rejection counters in the applier into broader buckets (rejected_create / rejected_increment / rejected_evidence_tool_mismatch / etc.). Tests now assert on outcomes (created? incremented? retired?) rather than specific counter names, so they survive future bucket churn.
inspect_user_learnings.py slimmed from 373 → 150 lines: dropped JSON output mode and most CLI flags. POC analytics doesn't need that surface yet.
Dropped fallback paths in the multi-execute parsing that weren't carrying their own weight (duplicate slug/index fan-out loops, the _extract_per_tool_status_code candidate cascade).
Pipeline hook simplified back to a single helper that lazy-imports the user_learnings package, runs the extractor, and swallows exceptions. No bool return, no error-message round-trip.

Bugbot Low addressed (`0bc438cf` follow-up)

_extract_per_tool_status_code previously accepted any int as a status code, including "code": 0 ("no error code"). That made _classify treat it as a definitive non-transient HTTP status, bypassing the body-substring fallback. Now only ints in 100–599 count as HTTP status codes.

Truncation comparison (response to "are we sending session data to LLM, what truncation?")

Field	This PR	Existing pipeline (workflow + error analysis)
Per-call `params`	5,000 chars	`SESSION_LOG_REQUEST=5000` / `WORKFLOW_LOG_REQUEST=8000`
Per-call `error_body`	5,000 chars	`SESSION_LOG_ERROR=5000` / `WORKFLOW_LOG_ERROR=12000`
Strings inside JSON	`max_str_len=400 / risky=200` via `sanitize_for_llm`	Workflow uses `2000`; session_log uses `200`
`prior_observations_for_review`	8 per tool (cap)	n/a

Same mask_pii + wrap_untrusted + sanitize_for_llm chain as the rest of rube_learning.

Prod no-op tests added (response to "in prod path is fully no-op right?")

New test_pipeline_integration.py, 6 cases, each verifying that with USER_LEARNINGS_ENABLED unset:

No LLM call is made
No Redis traffic is generated
The extractor module's expensive imports (Presidio, OpenAI, etc.) aren't paid
Force mode (--user-learnings-only) bypasses the env gate
Exceptions in the extractor stay caught at the boundary

test_stage_no_op_for_falsy_env_flag_values parametrizes over ["false", "0", "no", "", "off"]; test_stage_runs_extractor_when_env_flag_truthy parametrizes over ["true", "1", "yes"].

CI

Check	Status
`Lint - Integrator`	✅ success
`Test - Learning Pipeline`	✅ success

43 unit tests all pass.

Field	Before	After (from `rube_learning.config`)
`params` per call	5,000	`WORKFLOW_LOG_REQUEST_MAX_LENGTH` = 8,000
`error_body` per call	5,000	`WORKFLOW_LOG_ERROR_MAX_LENGTH` = 12,000
`sanitize_for_llm` per-string	400 / 200 (risky)	2,000 / 2,000 (matching `workflow_analysis/runner.py`)

Mode	env flag	allowlist
Regular pipeline	required	required
`--user-learnings-only` (backfill)	bypassed	bypassed

feat(rube/learning): per-user tool-learnings extractor (POC, observe-only)

Description

Design (one screen)

Self-reinforcement defenses (the integrity rules)

How did I test this PR

Follow-ups (not in this PR)

Post-PR status

Reviews + fixes shipped

Description

Design (one screen)

Self-reinforcement defenses (the integrity rules)

How did I test this PR

Follow-ups (not in this PR)

Post-PR status

Reviews + fixes shipped

Tests

CI

Runtime safety

Bugbot follow-up + CI status

Two more Bugbot findings on `bb222f7d` — fixed in `661fa91`

CI

Bugbot follow-up #2 — fixed in `7de8a18`

CI green ✅

Simplification pass + prod no-op tests

What got cut

Bugbot Low addressed (`0bc438cf` follow-up)

Truncation comparison (response to "are we sending session data to LLM, what truncation?")

Prod no-op tests added (response to "in prod path is fully no-op right?")

CI

Truncation alignment + fresh codex review

Truncation now consistent with workflow_analysis (commit `f8813294`)

Plus fixed Bugbot Medium on workflow-failure path

Re-ran codex review on the latest commit (`f8813294`); 3 new findings, all fixed in `570454ea`

CI

One Bugbot Low addressed (commit `6f5894a4`)

Follow-up commit `31fc1f6b` — backfill mode bypasses per-project allowlist

#	Severity	Finding	Fix
1	P1	Backfill mode marked sessions `success=True` even when extraction skipped (no Redis, extractor crashed)	`_run_user_learnings_stage` now returns a bool; backfill mode surfaces it as `PipelineResult.success`. Normal mode discards the return so prod path is unchanged.
2	P2	Existing learnings truncated to 8 in arbitrary Redis HGETALL order — same session could see different prior observations across runs	Sort by `ts_last_confirmed` desc then `failure_repro_count` desc before slicing
3	P2	`AsyncOpenAI` client allocated per session and never closed — leaks httpx pools in long-running polling loops	Wrap in `async with` so the temporary client is closed at the end of the request. Callers can still pass a shared client via `openai_client=` to amortise.

Check	Status
`Lint - Integrator`	✅ success
`Test - Learning Pipeline`	✅ success

Description

Design (one screen)

Self-reinforcement defenses (the integrity rules)

How did I test this PR

Follow-ups (not in this PR)

Post-PR status

Reviews + fixes shipped

Description

Design (one screen)

Self-reinforcement defenses (the integrity rules)

How did I test this PR

Follow-ups (not in this PR)

Post-PR status

Reviews + fixes shipped

Tests

CI

Runtime safety

Bugbot follow-up + CI status

Two more Bugbot findings on bb222f7d — fixed in 661fa91

CI

Bugbot follow-up #2 — fixed in 7de8a18

CI green ✅

Simplification pass + prod no-op tests

What got cut

Bugbot Low addressed (0bc438cf follow-up)

Truncation comparison (response to "are we sending session data to LLM, what truncation?")

Prod no-op tests added (response to "in prod path is fully no-op right?")

CI

Truncation alignment + fresh codex review

Truncation now consistent with workflow_analysis (commit f8813294)

Plus fixed Bugbot Medium on workflow-failure path

Re-ran codex review on the latest commit (f8813294); 3 new findings, all fixed in 570454ea

CI

One Bugbot Low addressed (commit 6f5894a4)

Follow-up commit 31fc1f6b — backfill mode bypasses per-project allowlist

Two more Bugbot findings on `bb222f7d` — fixed in `661fa91`

Bugbot follow-up #2 — fixed in `7de8a18`

Bugbot Low addressed (`0bc438cf` follow-up)

Truncation now consistent with workflow_analysis (commit `f8813294`)

Re-ran codex review on the latest commit (`f8813294`); 3 new findings, all fixed in `570454ea`

One Bugbot Low addressed (commit `6f5894a4`)

Follow-up commit `31fc1f6b` — backfill mode bypasses per-project allowlist