PR Custody

feat: add verify-bug agent for EagleEyes scanner validation

@sjd9021checks n/achecks…feat/verify-bug-agent → next24 files · +1859 −2updated 3mo ago

▸Description

Why

EagleEyes scanner has a 69% false positive rate (verified across 72 bugs in 8 apps). We need an automated way to verify scanner-flagged bugs via live API execution and — critically — explain why false positives occur so we can improve the scanner.

What

New verify_bug_agent under cortex/agents/ — a read-only agent that reproduces scanner-reported bugs via real API calls
Reuses TestAndFixToolProvider (same tools as fix agent: execute_current_action, execute_parent_action, execute_curl_action) so parent dependency resolution works out of the box
Returns structured VerifyBugResponse with:
- verdict: REAL_BUG / FALSE_POSITIVE / INCONCLUSIVE
- root_cause_class: CONFIRMED_BUG / FRAMEWORK_ABSORBS / API_LENIENT / CLIENT_HANDLES / SCANNER_WRONG / CODE_FLOW_MISSED
- evidence: actual API response proving the verdict
- explanation: why static analysis and runtime behavior disagree
- scanner_feedback: one-liner actionable fix for the scanner
New VerifyBugWorkflow registered as WorkflowKind.VerifyBug
Read-only: enable_file_edit=False, no git/PR logic

How to Test

Run the workflow with a known false positive from the session digest:

{
  "workflow": "verify-bug",
  "app_name": "gmail",
  "action_name": "GMAIL_FETCH_EMAILS",
  "bug_description": "URL encoding: label_ids containing @ character will fail",
  "scanner_category": "URL Encoding"
}

Verify the agent returns FALSE_POSITIVE with API_LENIENT root cause and evidence showing Gmail accepts unencoded @

Pre-Review Checklist

I have self-reviewed this PR

Notes

VerifyBugConfig extends TestAndFixWithFixInstructionConfig so TestAndFixToolProvider accepts it without type changes. fix_instruction holds the bug description, with a bug_description property alias.
Prompt embeds Mercury-specific FP knowledge (_validate_response leniency, model_dump forwarding, HTTP client auto-encoding) so the agent knows what to look for.
45-min timeout / 200 max turns (shorter than fix agent since no edit-lint-retry loop).
DB migration needed for WorkflowKind.VerifyBug enum value in CortexExecution.api_type.

🤖 Generated with Claude Code

loading diff…