Files
agent-framework/python/packages/lab/tau2/README.md
Eduard van Valkenburg 0521f5bed8 Python: [BREAKING] Simplify API: ChatAgent -> Agent, ChatMessage -> Message (#3747)
* [BREAKING] Rename ChatAgent -> Agent, ChatMessage -> Message, ChatClientProtocol -> SupportsChatGetResponse

Simplify the public API by removing redundant 'Chat' prefix from core types:
- ChatAgent -> Agent
- RawChatAgent -> RawAgent
- ChatMessage -> Message
- ChatClientProtocol -> SupportsChatGetResponse

Also renamed internal WorkflowMessage (was Message in _runner_context) to avoid collision.

No backward compatibility aliases - this is a clean breaking change.

* [BREAKING] Rename Agent chat_client parameter to client

* Fix rebase issues: WorkflowMessage references and broken markdown links

* Fix formatting and lint issues from code quality checks

* Fix import ordering in workflow sample files

* fixed rebase

* Fix test failures: use WorkflowMessage and A2AMessage after ChatMessage→Message rename

- Replace Message(data=..., source_id=...) with WorkflowMessage(...) in workflow tests
- Fix isinstance check in A2A agent to use A2AMessage instead of Message
- Fix import in test_workflow_observability.py (Message→WorkflowMessage)

* Fix lint, fmt, and sample errors after ChatMessage→Message rename

- Auto-fix 70+ ruff lint issues across samples (ChatMessage→Message refs)
- Fix HostedVectorStoreContent→Content.from_hosted_vector_store in file search sample
- Fix _normalize_messages→normalize_messages in custom agent sample
- Fix context.terminate→raise MiddlewareTermination in middleware samples
- Fix with_update_hook→with_transform_hook in override middleware sample
- Add TOptions_co import back to custom_chat_client sample
- Add noqa for FastAPI File() default in chatkit sample
- Fix B023 loop variable capture in weather agent sample

* fix: update Agent constructor calls from chat_client to client in declaration-only tool tests

* fix: add register_cleanup to devui lazy-loading proxy and type stub

* fixed tests and updated new pieces

* fix agui typevar

* fix merge errors

* fix merge conflicts

* fiux merge

* Remove unused links

---------

Co-authored-by: Evan Mattson <evan.mattson@microsoft.com>
2026-02-10 23:04:32 +00:00

199 lines
6.2 KiB
Markdown

# Agent Framework Lab - τ²-bench
τ²-bench implements a simulation framework for evaluating customer service agents across various domains.
> **Note**: This module is part of the consolidated `agent-framework-lab` package. Install the package with the `tau2` extra to use this module.
The framework orchestrates conversations between two AI agents:
- **Customer Service Agent**: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
- **User Simulator**: Simulates realistic customer behavior with specific goals and scenarios
Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.
## Supported Domains
| Domain | Status | Description |
| ----------- | ----------------- | ---------------------------------------------------------- |
| **airline** | ✅ Supported | Customer service for airline booking, changes, and support |
| **retail** | 🚧 In Development | E-commerce customer support scenarios |
| **telecom** | 🚧 In Development | Telecommunications service support |
_Note: Currently only the airline domain is fully supported._
## Installation
Install the agent-framework-lab package with TAU2 dependencies:
```bash
pip install "agent-framework-lab[tau2]"
```
**Important:** You must also install the tau2-bench package from source:
```bash
pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"
```
Download data from [Tau2-Bench](https://github.com/sierra-research/tau2-bench):
```bash
git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench
```
Export the data directory to `TAU2_DATA_DIR` environment variable:
```bash
export TAU2_DATA_DIR="data"
```
## Quick Start
### Running a Single Task
```python
import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks
async def run_single_task():
# Initialize the task runner
runner = TaskRunner(max_steps=50)
# Set up your LLM clients
assistant_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o"
)
user_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o-mini"
)
# Get a task and run it
tasks = get_tasks()
task = tasks[0] # Run the first task
conversation = await runner.run(task, assistant_client, user_client)
reward = runner.evaluate(task, conversation, runner.termination_reason)
print(f"Task completed with reward: {reward}")
# Run the example
asyncio.run(run_single_task())
```
### Running the Full Benchmark
Use the provided script to run the complete benchmark:
```bash
# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py
# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini
# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o
# Limit conversation length
python samples/run_benchmark.py --max-steps 20
```
## Results (on Airline Domain)
The following results are reproduced from our implementation of τ²-bench with `samples/run_benchmark.py`. It shows the average success rate over the dataset of 50 tasks.
| Agent Model | User Model | Success Rate |
| ------------ | ----------- | ------------ |
| gpt-5 | gpt-4.1 | 62.0% |
| gpt-5-mini | gpt-4.1 | 52.0% |
| gpt-4.1 | gpt-4.1 | 60.0% |
| gpt-4.1-mini | gpt-4.1 | 50.0% |
| gpt-4.1 | gpt-4o-mini | 42.0% |
| gpt-4o | gpt-4.1 | 42.0% |
| gpt-4o-mini | gpt-4.1 | 26.0% |
## Advanced Usage
### Environment Configuration
Set required environment variables:
```bash
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"
# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"
```
### Custom Agent Implementation
```python
from agent_framework.lab.tau2 import TaskRunner
from agent_framework import Agent
class CustomTaskRunner(TaskRunner):
def assistant_agent(self, assistant_chat_client):
# Override to customize the assistant agent
return Agent(
client=assistant_chat_client,
instructions="Your custom system prompt here",
# Add custom tools, temperature, etc.
)
def user_simulator(self, user_chat_client, task):
# Override to customize the user simulator
return Agent(
client=user_chat_client,
instructions="Custom user simulator prompt",
)
```
### Custom Workflow Integration
```python
from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner
class WorkflowTaskRunner(TaskRunner):
def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
# Create agent executors
assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")
# Build a custom workflow with start executor
builder = WorkflowBuilder(start_executor=assistant_executor)
builder.add_edge(assistant_executor, user_executor)
builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)
return builder.build()
```
### Utility Functions
```python
from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state
# Enable compatibility patches for τ²-bench integration
patch_env_set_state()
# Disable patches when done
unpatch_env_set_state()
```
## Contributing
This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.
## License
This project is licensed under the MIT License - see the LICENSE file for details.