Files
Eduard van Valkenburg b1b528e4a8 Python: [BREAKING] Remove deprecated kwargs compatibility paths (#4858)
* [BREAKING] Remove deprecated kwargs compatibility paths

Remove the deprecated kwargs compatibility shims across core agents, clients, tools, middleware, and telemetry.

Keep workflow kwargs behavior intact in this branch and follow up separately in #4850.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix PR CI fallout for kwargs removal

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* updates

* Fix Azure AI CI fallout

Remove the stale _get_current_conversation_id override from the Azure AI client after the OpenAI base helper was deleted.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fixed new classes

* Fix Assistants deprecated import gating

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix integration replay regressions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Switch multi-agent hosting samples to Azure chat completions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Simplify Azure multi-agent sample config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b1b528e4a8 · 2026-03-27 21:00:12 +00:00
History
..

Agent Framework Lab - τ²-bench

τ²-bench implements a simulation framework for evaluating customer service agents across various domains.

Note

: This module is part of the consolidated agent-framework-lab package. Install the package with the tau2 extra to use this module.

The framework orchestrates conversations between two AI agents:

  • Customer Service Agent: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
  • User Simulator: Simulates realistic customer behavior with specific goals and scenarios

Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.

Supported Domains

Domain Status Description
airline Supported Customer service for airline booking, changes, and support
retail 🚧 In Development E-commerce customer support scenarios
telecom 🚧 In Development Telecommunications service support

Note: Currently only the airline domain is fully supported.

Installation

Install the agent-framework-lab package with TAU2 dependencies:

pip install "agent-framework-lab[tau2]"

Important: You must also install the tau2-bench package from source:

pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"

Download data from Tau2-Bench:

git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench

Export the data directory to TAU2_DATA_DIR environment variable:

export TAU2_DATA_DIR="data"

Quick Start

Running a Single Task

import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks

async def run_single_task():
    # Initialize the task runner
    runner = TaskRunner(max_steps=50)

    # Set up your LLM clients
    assistant_client = OpenAIChatClient(
        base_url="https://api.openai.com/v1",
        api_key="your-api-key",
        model_id="gpt-4o"
    )
    user_client = OpenAIChatClient(
        base_url="https://api.openai.com/v1",
        api_key="your-api-key",
        model_id="gpt-4o-mini"
    )

    # Get a task and run it
    tasks = get_tasks()
    task = tasks[0]  # Run the first task

    conversation = await runner.run(task, assistant_client, user_client)
    reward = runner.evaluate(task, conversation, runner.termination_reason)

    print(f"Task completed with reward: {reward}")

# Run the example
asyncio.run(run_single_task())

Running the Full Benchmark

Use the provided script to run the complete benchmark:

# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py

# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini

# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o

# Limit conversation length
python samples/run_benchmark.py --max-steps 20

Results (on Airline Domain)

The following results are reproduced from our implementation of τ²-bench with samples/run_benchmark.py. It shows the average success rate over the dataset of 50 tasks.

Agent Model User Model Success Rate
gpt-5 gpt-4.1 62.0%
gpt-5-mini gpt-4.1 52.0%
gpt-4.1 gpt-4.1 60.0%
gpt-4.1-mini gpt-4.1 50.0%
gpt-4.1 gpt-4o-mini 42.0%
gpt-4o gpt-4.1 42.0%
gpt-4o-mini gpt-4.1 26.0%

Advanced Usage

Environment Configuration

Set required environment variables:

export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"

# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"

Custom Agent Implementation

from agent_framework.lab.tau2 import TaskRunner
from agent_framework import Agent

class CustomTaskRunner(TaskRunner):
    def assistant_agent(self, assistant_chat_client):
        # Override to customize the assistant agent
        return Agent(
            client=assistant_chat_client,
            instructions="Your custom system prompt here",
            # Add custom tools, temperature, etc.
        )

    def user_simulator(self, user_chat_client, task):
        # Override to customize the user simulator
        return Agent(
            client=user_chat_client,
            instructions="Custom user simulator prompt",
        )

Custom Workflow Integration

from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner

class WorkflowTaskRunner(TaskRunner):
    def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
        # Create agent executors
        assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
        user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")

        # Build a custom workflow with start executor
        builder = WorkflowBuilder(start_executor=assistant_executor)
        builder.add_edge(assistant_executor, user_executor)
        builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)

        return builder.build()

Utility Functions

from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state

# Enable compatibility patches for τ²-bench integration
patch_env_set_state()

# Disable patches when done
unpatch_env_set_state()

Contributing

This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.