Files
Eduard van Valkenburg 50fdcbaf57 Python: chore(python): improve dependency range automation (#4343)
* chore(python): improve dependency range automation

- tighten dependency bounds and coding standards guidance\n- add dependency range validation workflow, reporting, and issue automation\n- update related tests and dependency pins for compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* updated text and pyarrow

* new lock

* fixed workflow

* updated deps

* fix tiktoken

* chore(python): refine dependency validation workflows

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(python): add high-level dependency validation comments

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* WIP

* added additional comments and excludes

* added dev dependency handling and workflow and updates to package ranges

* added readme and simplified commands

* fix markers

* chore(python): address dependency review feedback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Tighten dependency bounds, remove stale overrides, restore Python 3.10 support

- Apply dependency bound policy across all packages: stable >=1.0 deps use
  >=floor,<next_major; pre-1.0/prerelease deps use validated hard-bounded ranges
- Remove stale root tool.uv.override-dependencies (uvicorn, websockets, grpcio)
- Lower github_copilot requires-python to >=3.10 with github-copilot-sdk gated
  behind python_version >= 3.11 marker; import raises ImportError on 3.10
- Skip github_copilot pyright/mypy/test tasks on Python <3.11
- Use version-conditional pyrightconfig for samples on Python 3.10
- Add compatibility fix in core responses client for older openai typed dicts
- Normalize uv.lock prerelease mode and refresh dev dependencies
- Update CODING_STANDARD.md, DEV_SETUP.md, and package management skill docs

Closes #902

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* small tweaks

* add note in workflow

* fix workflows and several versions

* fix duplicate

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
50fdcbaf57 · 2026-03-13 12:32:37 +00:00
History
..

Agent Framework Lab - τ²-bench

τ²-bench implements a simulation framework for evaluating customer service agents across various domains.

Note

: This module is part of the consolidated agent-framework-lab package. Install the package with the tau2 extra to use this module.

The framework orchestrates conversations between two AI agents:

  • Customer Service Agent: Follows domain-specific policies and has access to tools (e.g., booking systems, databases)
  • User Simulator: Simulates realistic customer behavior with specific goals and scenarios

Each evaluation runs a multi-turn conversation where the user simulator presents a customer service scenario, and the agent must resolve it following the domain policy while using available tools appropriately. The results are evaluated using τ²'s comprehensive evaluation system.

Supported Domains

Domain Status Description
airline Supported Customer service for airline booking, changes, and support
retail 🚧 In Development E-commerce customer support scenarios
telecom 🚧 In Development Telecommunications service support

Note: Currently only the airline domain is fully supported.

Installation

Install the agent-framework-lab package with TAU2 dependencies:

pip install "agent-framework-lab[tau2]"

Important: You must also install the tau2-bench package from source:

pip install "tau2 @ git+https://github.com/sierra-research/tau2-bench@5ba9e3e56db57c5e4114bf7f901291f09b2c5619"

Download data from Tau2-Bench:

git clone https://github.com/sierra-research/tau2-bench.git
mv tau2-bench/data/ .
rm -rf tau2-bench

Export the data directory to TAU2_DATA_DIR environment variable:

export TAU2_DATA_DIR="data"

Quick Start

Running a Single Task

import asyncio
from agent_framework.openai import OpenAIChatClient
from agent_framework.lab.tau2 import TaskRunner
from tau2.domains.airline.environment import get_tasks

async def run_single_task():
    # Initialize the task runner
    runner = TaskRunner(max_steps=50)

    # Set up your LLM clients
    assistant_client = OpenAIChatClient(
        base_url="https://api.openai.com/v1",
        api_key="your-api-key",
        model_id="gpt-4o"
    )
    user_client = OpenAIChatClient(
        base_url="https://api.openai.com/v1",
        api_key="your-api-key",
        model_id="gpt-4o-mini"
    )

    # Get a task and run it
    tasks = get_tasks()
    task = tasks[0]  # Run the first task

    conversation = await runner.run(task, assistant_client, user_client)
    reward = runner.evaluate(task, conversation, runner.termination_reason)

    print(f"Task completed with reward: {reward}")

# Run the example
asyncio.run(run_single_task())

Running the Full Benchmark

Use the provided script to run the complete benchmark:

# Run with default models (gpt-4.1 for both agent and user)
python samples/run_benchmark.py

# Use custom models
python samples/run_benchmark.py --assistant gpt-4o --user gpt-4o-mini

# Debug a specific task
python samples/run_benchmark.py --debug-task-id task_001 --assistant gpt-4o

# Limit conversation length
python samples/run_benchmark.py --max-steps 20

Results (on Airline Domain)

The following results are reproduced from our implementation of τ²-bench with samples/run_benchmark.py. It shows the average success rate over the dataset of 50 tasks.

Agent Model User Model Success Rate
gpt-5 gpt-4.1 62.0%
gpt-5-mini gpt-4.1 52.0%
gpt-4.1 gpt-4.1 60.0%
gpt-4.1-mini gpt-4.1 50.0%
gpt-4.1 gpt-4o-mini 42.0%
gpt-4o gpt-4.1 42.0%
gpt-4o-mini gpt-4.1 26.0%

Advanced Usage

Environment Configuration

Set required environment variables:

export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_API_KEY="your-api-key"

# Optional: for custom endpoints
export OPENAI_BASE_URL="https://your-custom-endpoint.com/v1"

Custom Agent Implementation

from agent_framework.lab.tau2 import TaskRunner
from agent_framework import Agent

class CustomTaskRunner(TaskRunner):
    def assistant_agent(self, assistant_chat_client):
        # Override to customize the assistant agent
        return Agent(
            client=assistant_chat_client,
            instructions="Your custom system prompt here",
            # Add custom tools, temperature, etc.
        )

    def user_simulator(self, user_chat_client, task):
        # Override to customize the user simulator
        return Agent(
            client=user_chat_client,
            instructions="Custom user simulator prompt",
        )

Custom Workflow Integration

from agent_framework import WorkflowBuilder, AgentExecutor
from agent_framework.lab.tau2 import TaskRunner

class WorkflowTaskRunner(TaskRunner):
    def build_conversation_workflow(self, assistant_agent, user_simulator_agent):
        # Create agent executors
        assistant_executor = AgentExecutor(assistant_agent, id="assistant_agent")
        user_executor = AgentExecutor(user_simulator_agent, id="user_simulator")

        # Build a custom workflow with start executor
        builder = WorkflowBuilder(start_executor=assistant_executor)
        builder.add_edge(assistant_executor, user_executor)
        builder.add_edge(user_executor, assistant_executor, condition=self.should_not_stop)

        return builder.build()

Utility Functions

from agent_framework.lab.tau2 import patch_env_set_state, unpatch_env_set_state

# Enable compatibility patches for τ²-bench integration
patch_env_set_state()

# Disable patches when done
unpatch_env_set_state()

Contributing

This package is part of the Microsoft Agent Framework Lab. Please see the main repository for contribution guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.