* python: replace pre-commit with prek, add PEP 723 script deps, clean up dev dependencies - Replace pre-commit with prek (Rust-native, faster pre-commit alternative) - Move supported hooks to repo: builtin for zero-clone speed - Add new builtin hooks: trailing-whitespace, check-merge-conflict, detect-private-key, check-added-large-files - Update all hook versions to latest (pre-commit-hooks v6, pyupgrade v3.21.2, bandit 1.9.3, uv-pre-commit 0.10.0) - Add PEP 723 inline script metadata to 34 samples with external deps - Remove autogen-agentchat/autogen-ext from dev deps (now declared per-sample) - Remove unused dev deps: pytest-env, tomli-w - Add agent-framework-core>=1.0.0b260130 lower bound to all 21 packages - Update CI workflow to use j178/prek-action - Update docs: DEV_SETUP.md, AGENTS.md, CODING_STANDARD.md, SAMPLE_GUIDELINES.md * updated lock * python: fix prek config paths for local execution and CI workflow Remove global 'files: ^python/' filter and strip python/ prefix from all path patterns in .pre-commit-config.yaml so prek finds files when run from the python/ directory. Update CI workflow to use --cd python instead of --config path. Include trailing whitespace fixes and dev dependency cleanup. * python: move helper scripts to scripts/ folder and exclude from checks * python: exclude AGENTS.md from prek markdown code lint * python: exclude AGENTS.md and azure_ai_search sample from markdown lint * fix m365 sample * python: ignore CPY rule for samples with PEP 723 headers * fix in dev_setup * python: replace aiofiles with regular open in samples * python: suppress reportUnusedImport in markdown code block checker * python: use samples pyright config for markdown code block checker Write a temp pyrightconfig.json matching pyrightconfig.samples.json rules (typeCheckingMode=off, only reportMissingImports and reportAttributeAccessIssue). Filter output to only fail on these rules since syntax-level errors (top-level await, undefined vars) are expected in README documentation snippets. * python: use markdown-code-lint with fixed globs instead of prek file list The prek-markdown-code-lint task received all changed files including non-README markdown and files with pre-existing broken imports. Replace with the standard markdown-code-lint task which uses the correct glob patterns (README.md, packages/**/README.md, samples/**/*.md). * python: exclude READMEs with pre-existing broken imports from markdown lint * python: fix broken README code snippets instead of excluding them - ag-ui: replace TextContent (removed) with content.type == 'text' - durabletask: fix import path to durabletask.worker.TaskHubGrpcWorker - orchestrations: use constructor params instead of .participants() method - observability: mark deprecated code blocks as plain text, filter reportMissingImports to agent_framework modules only - remove README excludes from markdown-code-lint task * add revision to gaia download * feat(python): parallelize checks across packages Run (package × task) cross-product in parallel using ThreadPoolExecutor and subprocesses. Key changes: - Add scripts/task_runner.py with shared parallel execution engine - Update run_tasks_in_packages_if_exists.py to accept multiple tasks - Update run_tasks_in_changed_packages.py with --files flag and parallel support - Add check-packages poe task (fmt+lint+pyright+mypy in parallel) - Add prek-markdown-code-lint and prek-samples-check with change detection - Split CI code quality workflow into parallel prek and mypy jobs - Update DEV_SETUP.md to document new parallel behavior Core package changes still trigger checks on all packages. * feat(ci): split code quality into 4 parallel jobs Split the single prek job into parallel jobs: - pre-commit-hooks: lightweight hooks (SKIP=poe-check) - package-checks: fmt/lint/pyright/mypy via check-packages - samples-markdown: samples-lint, samples-syntax, markdown-code-lint - mypy: change-detected mypy checks All 4 jobs run concurrently (×2 Python versions = 8 runners). * feat(ci): use only Python 3.10 for code quality checks * refactor(python): add future annotations and remove quoted types Add `from __future__ import annotations` to 93 package files that used quoted string annotations, then run pyupgrade --py310-plus to remove the now-unnecessary quotes. Fixes https://github.com/microsoft/agent-framework/issues/3578
Agent Framework Lab - τ²-bench
τ²-bench implements a simulation framework for evaluating customer service agents across various domains.
Note
: This module is part of the consolidated
agent-framework-labpackage. Install the package with thetau2extra to use this module.
The framework orchestrates conversations between two AI agents:
- Customer Service Agent: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
- User Simulator: Simulates realistic customer behavior with specific goals and scenarios
Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.
Supported Domains
| Domain | Status | Description |
|---|---|---|
| airline | ✅ Supported | Customer service for airline booking, changes, and support |
| retail | 🚧 In Development | E-commerce customer support scenarios |
| telecom | 🚧 In Development | Telecommunications service support |
Note: Currently only the airline domain is fully supported.
Installation
Install the agent-framework-lab package with TAU2 dependencies:
pip install "agent-framework-lab[tau2]"
Important: You must also install the tau2-bench package from source:
pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"
Download data from Tau2-Bench:
git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench
Export the data directory to TAU2_DATA_DIR environment variable:
export TAU2_DATA_DIR="data"
Quick Start
Running a Single Task
import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks
async def run_single_task():
# Initialize the task runner
runner = TaskRunner(max_steps=50)
# Set up your LLM clients
assistant_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o"
)
user_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o-mini"
)
# Get a task and run it
tasks = get_tasks()
task = tasks[0] # Run the first task
conversation = await runner.run(task, assistant_client, user_client)
reward = runner.evaluate(task, conversation, runner.termination_reason)
print(f"Task completed with reward: {reward}")
# Run the example
asyncio.run(run_single_task())
Running the Full Benchmark
Use the provided script to run the complete benchmark:
# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py
# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini
# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o
# Limit conversation length
python samples/run_benchmark.py --max-steps 20
Results (on Airline Domain)
The following results are reproduced from our implementation of τ²-bench with samples/run_benchmark.py. It shows the average success rate over the dataset of 50 tasks.
| Agent Model | User Model | Success Rate |
|---|---|---|
| gpt-5 | gpt-4.1 | 62.0% |
| gpt-5-mini | gpt-4.1 | 52.0% |
| gpt-4.1 | gpt-4.1 | 60.0% |
| gpt-4.1-mini | gpt-4.1 | 50.0% |
| gpt-4.1 | gpt-4o-mini | 42.0% |
| gpt-4o | gpt-4.1 | 42.0% |
| gpt-4o-mini | gpt-4.1 | 26.0% |
Advanced Usage
Environment Configuration
Set required environment variables:
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"
# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"
Custom Agent Implementation
from agent_framework.lab.tau2 import TaskRunner
from agent_framework import ChatAgent
class CustomTaskRunner(TaskRunner):
def assistant_agent(self, assistant_chat_client):
# Override to customize the assistant agent
return ChatAgent(
chat_client=assistant_chat_client,
instructions="Your custom system prompt here",
# Add custom tools, temperature, etc.
)
def user_simulator(self, user_chat_client, task):
# Override to customize the user simulator
return ChatAgent(
chat_client=user_chat_client,
instructions="Custom user simulator prompt",
)
Custom Workflow Integration
from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner
class WorkflowTaskRunner(TaskRunner):
def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
# Create agent executors
assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")
# Build a custom workflow with start executor
builder = WorkflowBuilder(start_executor=assistant_executor)
builder.add_edge(assistant_executor, user_executor)
builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)
return builder.build()
Utility Functions
from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state
# Enable compatibility patches for τ²-bench integration
patch_env_set_state()
# Disable patches when done
unpatch_env_set_state()
Contributing
This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.