* [BREAKING] Remove deprecated kwargs compatibility paths Remove the deprecated kwargs compatibility shims across core agents, clients, tools, middleware, and telemetry. Keep workflow kwargs behavior intact in this branch and follow up separately in #4850. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PR CI fallout for kwargs removal Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR review feedback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * updates * Fix Azure AI CI fallout Remove the stale _get_current_conversation_id override from the Azure AI client after the OpenAI base helper was deleted. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fixed new classes * Fix Assistants deprecated import gating Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix integration replay regressions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Switch multi-agent hosting samples to Azure chat completions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Simplify Azure multi-agent sample config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent Framework Lab - τ²-bench
τ²-bench implements a simulation framework for evaluating customer service agents across various domains.
Note
: This module is part of the consolidated
agent-framework-labpackage. Install the package with thetau2extra to use this module.
The framework orchestrates conversations between two AI agents:
- Customer Service Agent: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
- User Simulator: Simulates realistic customer behavior with specific goals and scenarios
Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.
Supported Domains
| Domain | Status | Description |
|---|---|---|
| airline | ✅ Supported | Customer service for airline booking, changes, and support |
| retail | 🚧 In Development | E-commerce customer support scenarios |
| telecom | 🚧 In Development | Telecommunications service support |
Note: Currently only the airline domain is fully supported.
Installation
Install the agent-framework-lab package with TAU2 dependencies:
pip install "agent-framework-lab[tau2]"
Important: You must also install the tau2-bench package from source:
pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"
Download data from Tau2-Bench:
git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench
Export the data directory to TAU2_DATA_DIR environment variable:
export TAU2_DATA_DIR="data"
Quick Start
Running a Single Task
import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks
async def run_single_task():
# Initialize the task runner
runner = TaskRunner(max_steps=50)
# Set up your LLM clients
assistant_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o"
)
user_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o-mini"
)
# Get a task and run it
tasks = get_tasks()
task = tasks[0] # Run the first task
conversation = await runner.run(task, assistant_client, user_client)
reward = runner.evaluate(task, conversation, runner.termination_reason)
print(f"Task completed with reward: {reward}")
# Run the example
asyncio.run(run_single_task())
Running the Full Benchmark
Use the provided script to run the complete benchmark:
# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py
# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini
# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o
# Limit conversation length
python samples/run_benchmark.py --max-steps 20
Results (on Airline Domain)
The following results are reproduced from our implementation of τ²-bench with samples/run_benchmark.py. It shows the average success rate over the dataset of 50 tasks.
| Agent Model | User Model | Success Rate |
|---|---|---|
| gpt-5 | gpt-4.1 | 62.0% |
| gpt-5-mini | gpt-4.1 | 52.0% |
| gpt-4.1 | gpt-4.1 | 60.0% |
| gpt-4.1-mini | gpt-4.1 | 50.0% |
| gpt-4.1 | gpt-4o-mini | 42.0% |
| gpt-4o | gpt-4.1 | 42.0% |
| gpt-4o-mini | gpt-4.1 | 26.0% |
Advanced Usage
Environment Configuration
Set required environment variables:
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"
# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"
Custom Agent Implementation
from agent_framework.lab.tau2 import TaskRunner
from agent_framework import Agent
class CustomTaskRunner(TaskRunner):
def assistant_agent(self, assistant_chat_client):
# Override to customize the assistant agent
return Agent(
client=assistant_chat_client,
instructions="Your custom system prompt here",
# Add custom tools, temperature, etc.
)
def user_simulator(self, user_chat_client, task):
# Override to customize the user simulator
return Agent(
client=user_chat_client,
instructions="Custom user simulator prompt",
)
Custom Workflow Integration
from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner
class WorkflowTaskRunner(TaskRunner):
def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
# Create agent executors
assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")
# Build a custom workflow with start executor
builder = WorkflowBuilder(start_executor=assistant_executor)
builder.add_edge(assistant_executor, user_executor)
builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)
return builder.build()
Utility Functions
from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state
# Enable compatibility patches for τ²-bench integration
patch_env_set_state()
# Disable patches when done
unpatch_env_set_state()
Contributing
This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.