* Replace Role and FinishReason classes with NewType + Literal
- Remove EnumLike metaclass from _types.py
- Replace Role class with NewType('Role', str) + RoleLiteral
- Replace FinishReason class with NewType('FinishReason', str) + FinishReasonLiteral
- Update all usages across codebase to use string literals
- Remove .value access patterns (direct string comparison now works)
- Add backward compatibility for legacy dict serialization format
- Update tests to reflect new string-based types
Addresses #3591, #3615
* Simplify ChatResponse and AgentResponse type hints (#3592)
- Remove overloads from ChatResponse.__init__
- Remove text parameter from ChatResponse.__init__
- Remove | dict[str, Any] from finish_reason and usage_details params
- Remove **kwargs from AgentResponse.__init__
- Both now accept ChatMessage | Sequence[ChatMessage] | None for messages
- Update docstrings and examples to reflect changes
- Fix tests that were using removed kwargs
- Fix Role type hint usage in ag-ui utils
* Remove text parameter from ChatResponseUpdate and AgentResponseUpdate (#3597)
- Remove text parameter from ChatResponseUpdate.__init__
- Remove text parameter from AgentResponseUpdate.__init__
- Remove **kwargs from both update classes
- Simplify contents parameter type to Sequence[Content] | None
- Update all usages to use contents=[Content.from_text(...)] pattern
- Fix imports in test files
- Update docstrings and examples
* Rename from_chat_response_updates to from_updates (#3593)
- ChatResponse.from_chat_response_updates → ChatResponse.from_updates
- ChatResponse.from_chat_response_generator → ChatResponse.from_update_generator
- AgentResponse.from_agent_run_response_updates → AgentResponse.from_updates
* Remove try_parse_value method from ChatResponse and AgentResponse (#3595)
- Remove try_parse_value method from ChatResponse
- Remove try_parse_value method from AgentResponse
- Remove try_parse_value calls from from_updates and from_update_generator methods
- Update samples to use try/except with response.value instead
- Update tests to use response.value pattern
- Users should now use response.value with try/except for safe parsing
* Add agent_id to AgentResponse and clarify author_name documentation (#3596)
- Add agent_id parameter to AgentResponse class
- Document that author_name is on ChatMessage objects, not responses
- Update ChatResponse docstring with author_name note
- Update AgentResponse docstring with author_name note
* Simplify ChatMessage.__init__ signature (#3618)
- Make contents a positional argument accepting Sequence[Content | str]
- Auto-convert strings in contents to TextContent
- Remove overloads, keep text kwarg for backward compatibility with serialization
- Update _parse_content_list to handle string items
- Update all usages across codebase to use new format: ChatMessage("role", ["text"])
* Allow Content as input on run and get_response
- Update prepare_messages and normalize_messages to accept Content
- Update type signatures in _agents.py and _clients.py
- Add tests for Content input handling
* Fix ChatMessage usage across packages and samples
Update all remaining ChatMessage(role=..., text=...) to use new
ChatMessage('role', ['text']) signature.
* Fix Role string usage and response format parsing
- Fix redis provider: remove .value access on string literals
- Fix durabletask ensure_response_format: set _response_format before accessing .value
* Fix ollama .value and ai_model_id issues, handle None in content list
- Fix ollama _chat_client: remove .value on string literals
- Fix ollama _chat_client: rename ai_model_id to model_id
- Fix _parse_content_list: skip None values gracefully
* Fix A2AAgent type signature to include Content
* Fix Role/FinishReason NewType dict annotations and improve test coverage to 95%
* Fix mypy errors for Role/FinishReason NewType usage
* Fix Role.TOOL and Role.ASSISTANT usage in _orchestrator_helpers.py
* Fix Role NewType usage in durabletask _models.py
Agent Framework Lab - τ²-bench
τ²-bench implements a simulation framework for evaluating customer service agents across various domains.
Note
: This module is part of the consolidated
agent-framework-labpackage. Install the package with thetau2extra to use this module.
The framework orchestrates conversations between two AI agents:
- Customer Service Agent: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
- User Simulator: Simulates realistic customer behavior with specific goals and scenarios
Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.
Supported Domains
| Domain | Status | Description |
|---|---|---|
| airline | ✅ Supported | Customer service for airline booking, changes, and support |
| retail | 🚧 In Development | E-commerce customer support scenarios |
| telecom | 🚧 In Development | Telecommunications service support |
Note: Currently only the airline domain is fully supported.
Installation
Install the agent-framework-lab package with TAU2 dependencies:
pip install "agent-framework-lab[tau2]"
Important: You must also install the tau2-bench package from source:
pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"
Download data from Tau2-Bench:
git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench
Export the data directory to TAU2_DATA_DIR environment variable:
export TAU2_DATA_DIR="data"
Quick Start
Running a Single Task
import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks
async def run_single_task():
# Initialize the task runner
runner = TaskRunner(max_steps=50)
# Set up your LLM clients
assistant_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o"
)
user_client = OpenAIChatClient(
base_url="https://api.openai.com/v1",
api_key="your-api-key",
model_id="gpt-4o-mini"
)
# Get a task and run it
tasks = get_tasks()
task = tasks[0] # Run the first task
conversation = await runner.run(task, assistant_client, user_client)
reward = runner.evaluate(task, conversation, runner.termination_reason)
print(f"Task completed with reward: {reward}")
# Run the example
asyncio.run(run_single_task())
Running the Full Benchmark
Use the provided script to run the complete benchmark:
# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py
# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini
# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o
# Limit conversation length
python samples/run_benchmark.py --max-steps 20
Results (on Airline Domain)
The following results are reproduced from our implementation of τ²-bench with samples/run_benchmark.py. It shows the average success rate over the dataset of 50 tasks.
| Agent Model | User Model | Success Rate |
|---|---|---|
| gpt-5 | gpt-4.1 | 62.0% |
| gpt-5-mini | gpt-4.1 | 52.0% |
| gpt-4.1 | gpt-4.1 | 60.0% |
| gpt-4.1-mini | gpt-4.1 | 50.0% |
| gpt-4.1 | gpt-4o-mini | 42.0% |
| gpt-4o | gpt-4.1 | 42.0% |
| gpt-4o-mini | gpt-4.1 | 26.0% |
Advanced Usage
Environment Configuration
Set required environment variables:
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"
# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"
Custom Agent Implementation
from agent_framework.lab.tau2 import TaskRunner
from agent_framework import ChatAgent
class CustomTaskRunner(TaskRunner):
def assistant_agent(self, assistant_chat_client):
# Override to customize the assistant agent
return ChatAgent(
chat_client=assistant_chat_client,
instructions="Your custom system prompt here",
# Add custom tools, temperature, etc.
)
def user_simulator(self, user_chat_client, task):
# Override to customize the user simulator
return ChatAgent(
chat_client=user_chat_client,
instructions="Custom user simulator prompt",
)
Custom Workflow Integration
from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner
class WorkflowTaskRunner(TaskRunner):
def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
# Build a custom workflow
builder = WorkflowBuilder()
# Create agent executors
assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")
# Add workflow edges and conditions
builder.set_start_executor(assistant_executor)
builder.add_edge(assistant_executor, user_executor)
builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)
return builder.build()
Utility Functions
from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state
# Enable compatibility patches for τ²-bench integration
patch_env_set_state()
# Disable patches when done
unpatch_env_set_state()
Contributing
This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.