The MetaBot RCA report was surfacing false positives. A tool would appear as "Completely Broken" even when its known fixable bug was dormant and the high error rate was entirely from unrelated causes (e.g. free-tier rate limits returning HTTP 200).
Root cause: get_broken_tools had a structural mismatch between its two criteria:
currently_fixable): 14-day lookback — "did this tool ever have a fixable bug?"error_rate): 24h lookback from raw execution logs — counts all errors regardless of typeThese are computed independently with no link between them. A tool with a real fixable bug from 13 days ago could show 92.6% error rate today if unrelated errors (rate limits, user mistakes) happen to spike — and the report would say it's broken due to the fixable bug when it isn't.
Concrete example confirmed in prod data: ALPHA_VANTAGE_TIME_SERIES_DAILY showed 92.6% error rate (63/68 calls) in the MetaBot report. The fixable bug ("Invalid API call") had 0 occurrences in the last 24h — the 63 errors were entirely free-tier rate limits (HTTP 200 with rate-limit JSON body). The fixable bug was last seen Feb 12, 7 days prior.
recently_active_fixable CTE to get_broken_tools that confirms the specific fixable error_hash was seen in watchdog_analysis_logs within the last 7 daysErrorProcessor.dedup_days=7) and the Cortex PR dedup window (THRESHOLD_DAYS=7). During those 7 days, the fixer PR is still "live" in Cortex and watchdog writes dedup-skip records each time the bug recursfixable_occurrences_7d to the SELECT as additional context alongside the raw error_rateSOFT_RATE_LIMIT_PATTERNS) from queries.py — the LLM already correctly classifies rate-limit 200 errors as not_a_bug; the fix belongs in the RCA query, not upstream filteringservice.py (watchdog runs every 30min, not 6h)ALPHA_VANTAGE_TIME_SERIES_DAILY no longer appears in the next MetaBot RCA reply for "Completely Broken" alerts (fixable hash last seen >7 days ago)METAADS_CREATE_AD_SET still appears (fixable hash seen Feb 13, within 7 days, bug confirmed still occurring with same error_subcode 33 today)The recently_active_fixable gate uses error_hash not is_fixable_bug for the join. This is intentional: dedup-skip records written by record_occurrence have is_fixable_bug=0 by default (the column is not in the INSERT), so filtering on is_fixable_bug=1 would miss all recurrences of the fixable bug after the initial analysis. Joining on error_hash = fixable_error_hash correctly captures both the initial analysis record and all subsequent dedup-skip records.
Metabase dashboard showing the same query will need a manual SQL update to match.