Overview
This PR introduces a comprehensive benchmark framework for evaluating Tool Router configurations and features. The framework enables systematic comparison of different Tool Router paths (configurations) across a fixed set of usecases, leveraging the existing Learning Pipeline to assess session quality. There's a design doc updated with the latest changes, that you can refer to.
Key Features
🎯 Benchmark Framework
- Usecase Registry: Define single-prompt tasks with difficulty labels (EASY/MEDIUM/HARD) and connected account metadata
- Path Registry: Define Tool Router configurations (project ID, model, feature flags)
- Benchmark Runs: Execute Cartesian product of (paths × usecases) with configurable concurrency
- Auto-generated Run IDs: Benchmark run IDs are automatically generated from config filename and timestamp
🚀 LaunchDarkly Integration
- Automatic feature flag toggling for
vectorSearchEnabled flag per project
- Graceful flag state restoration after each path execution
- Environment-based configuration with graceful degradation for local testing
- Sequential path execution to avoid global flag conflicts
📊 HTML Report Generation
- Summary Cards: Per-path success rates with color-coded metrics
- Comparison Table: Side-by-side comparison of paths across usecases
- Drill-down Modals: Detailed session analysis with:
- Overview: Error weights, workflow summary, failure reasons
- Trace: Full tool execution logs (expandable)
- Metrics: Top token consumers, search queries
- Queries: Pre-formatted SQL with syntax highlighting and copy functionality
- ClickHouse Integration: Direct links to table views and pre-formatted queries
🔧 Learning Pipeline Integration
- Benchmark mode support (
mode="benchmark")
- Separate ClickHouse tables for benchmark data (isolated from production)
- Session building from MCP log IDs with fallback to session ID search
- Parallel pipeline execution with configurable concurrency
🛠️ CLI Tooling
- Run benchmarks from YAML configs:
python -m tool_router_benchmark run --config path/to/config.yaml
- Batch execution:
python -m tool_router_benchmark run --config-dir
- Dry-run mode for validation
- Standalone report generation:
python -m tool_router_benchmark report --run-id <id>
- Auto-generated reports at end of benchmark runs
Technical Changes
New Components
tool_router_benchmark/ - Complete benchmark framework module
__main__.py - CLI entry point
runner.py - Benchmark orchestration
agent_runner.py - OpenAI Agents SDK integration with Composio MCP
session_builder.py - Session construction from logs
report.py - HTML report generation (1131 lines)
config.py - Configuration models and validation
models.py - Data models for benchmarks
loaders.py - YAML config loading
ld_client.py - LaunchDarkly client integration
eval_configs/ - Usecase and path definitions
Database Updates
- Enhanced Learning Pipeline schema with benchmark support
- Benchmark-specific ClickHouse tables
- Repository updates for benchmark data handling
Learning Pipeline Updates
- Benchmark mode parameter support
- Response handling improvements for both string and dict formats
- Connected accounts support in usecase metadata and session creation
Configuration
Benchmark Config Format
paths:
- name: "planning_enabled"
tool_router_project_id: "proj_xxx"
model: "gpt-4o"
description: "Planning enabled via vector search"
usecases:
- name: "send_email"
difficulty: "EASY"
prompt: "Send an email to..."
connected_accounts:
gmail: "acc_123"
concurrency:
agents: 5
pipeline: 10
timeout_seconds_per_session: 600
enabled: true
Usage Examples
# Run single benchmark config
python -m tool_router_benchmark run --config eval_configs/benchmark_configs/planning_comparison.yaml
# Run all enabled configs
python -m tool_router_benchmark run --config-dir
# Dry run (validate without executing)
python -m tool_router_benchmark run --config config.yaml --dry-run
# Generate report from existing results
python -m tool_router_benchmark report --run-id planning_comparison_20251208_140000
Testing
- Config validation before execution
- Dry-run mode for matrix preview
- Graceful error handling and recovery
- Comprehensive logging throughout execution
Dependencies
- Updated to latest Composio PyPI package
- LaunchDarkly SDK for feature flag management
- OpenAI Agents SDK for agent execution
- Updated
uv.lock with new dependencies
Next Steps
- Move usecase connected account IDs to partnership account
Related Documentation
See tool_router_benchmark/DESIGN_DOC.md for comprehensive architecture and design details.