PR Custody

feat: add tool router eval framework

@wjayeshchecks n/achecks…feat/tool-router-eval → master83 files · +21072 −9164updated 1mo ago

▸Description

Overview

This PR introduces a comprehensive benchmark framework for evaluating Tool Router configurations and features. The framework enables systematic comparison of different Tool Router paths (configurations) across a fixed set of usecases, leveraging the existing Learning Pipeline to assess session quality. There's a design doc updated with the latest changes, that you can refer to.

Key Features

🎯 Benchmark Framework

Usecase Registry: Define single-prompt tasks with difficulty labels (EASY/MEDIUM/HARD) and connected account metadata
Path Registry: Define Tool Router configurations (project ID, model, feature flags)
Benchmark Runs: Execute Cartesian product of (paths × usecases) with configurable concurrency
Auto-generated Run IDs: Benchmark run IDs are automatically generated from config filename and timestamp

🚀 LaunchDarkly Integration

Automatic feature flag toggling for vectorSearchEnabled flag per project
Graceful flag state restoration after each path execution
Environment-based configuration with graceful degradation for local testing
Sequential path execution to avoid global flag conflicts

📊 HTML Report Generation

Summary Cards: Per-path success rates with color-coded metrics
Comparison Table: Side-by-side comparison of paths across usecases
Drill-down Modals: Detailed session analysis with:
- Overview: Error weights, workflow summary, failure reasons
- Trace: Full tool execution logs (expandable)
- Metrics: Top token consumers, search queries
- Queries: Pre-formatted SQL with syntax highlighting and copy functionality
ClickHouse Integration: Direct links to table views and pre-formatted queries

🔧 Learning Pipeline Integration

Benchmark mode support (mode="benchmark")
Separate ClickHouse tables for benchmark data (isolated from production)
Session building from MCP log IDs with fallback to session ID search
Parallel pipeline execution with configurable concurrency

🛠️ CLI Tooling

Run benchmarks from YAML configs: python -m tool_router_benchmark run --config path/to/config.yaml
Batch execution: python -m tool_router_benchmark run --config-dir
Dry-run mode for validation
Standalone report generation: python -m tool_router_benchmark report --run-id <id>
Auto-generated reports at end of benchmark runs

Technical Changes

New Components

tool_router_benchmark/ - Complete benchmark framework module
- __main__.py - CLI entry point
- runner.py - Benchmark orchestration
- agent_runner.py - OpenAI Agents SDK integration with Composio MCP
- session_builder.py - Session construction from logs
- report.py - HTML report generation (1131 lines)
- config.py - Configuration models and validation
- models.py - Data models for benchmarks
- loaders.py - YAML config loading
- ld_client.py - LaunchDarkly client integration
- eval_configs/ - Usecase and path definitions

Database Updates

Enhanced Learning Pipeline schema with benchmark support
Benchmark-specific ClickHouse tables
Repository updates for benchmark data handling

Learning Pipeline Updates

Benchmark mode parameter support
Response handling improvements for both string and dict formats
Connected accounts support in usecase metadata and session creation

Configuration

Benchmark Config Format

paths:
  - name: "planning_enabled"
    tool_router_project_id: "proj_xxx"
    model: "gpt-4o"
    description: "Planning enabled via vector search"

usecases:
  - name: "send_email"
    difficulty: "EASY"
    prompt: "Send an email to..."
    connected_accounts:
      gmail: "acc_123"

concurrency:
  agents: 5
  pipeline: 10
timeout_seconds_per_session: 600
enabled: true

Usage Examples

# Run single benchmark config
python -m tool_router_benchmark run --config eval_configs/benchmark_configs/planning_comparison.yaml

# Run all enabled configs
python -m tool_router_benchmark run --config-dir

# Dry run (validate without executing)
python -m tool_router_benchmark run --config config.yaml --dry-run

# Generate report from existing results
python -m tool_router_benchmark report --run-id planning_comparison_20251208_140000

Testing

Config validation before execution
Dry-run mode for matrix preview
Graceful error handling and recovery
Comprehensive logging throughout execution

Dependencies

Updated to latest Composio PyPI package
LaunchDarkly SDK for feature flag management
OpenAI Agents SDK for agent execution
Updated uv.lock with new dependencies

Next Steps

Move usecase connected account IDs to partnership account

feat: add tool router eval framework

Overview

Key Features

🎯 Benchmark Framework

🚀 LaunchDarkly Integration

📊 HTML Report Generation

🔧 Learning Pipeline Integration

🛠️ CLI Tooling

Technical Changes

New Components

Database Updates

Learning Pipeline Updates

Configuration

Benchmark Config Format

Usage Examples

Testing

Dependencies

Next Steps

Related Documentation

Overview

Key Features

🎯 Benchmark Framework

🚀 LaunchDarkly Integration

📊 HTML Report Generation

🔧 Learning Pipeline Integration

🛠️ CLI Tooling

Technical Changes

New Components

Database Updates

Learning Pipeline Updates

Configuration

Benchmark Config Format

Usage Examples

Testing

Dependencies

Next Steps

Related Documentation