Files
alliscode 45527eed29 Foundry Evals integration for Python
Merged and refactored eval module per Eduard's PR review:

- Merge _eval.py + _local_eval.py into single _evaluation.py
- Convert EvalItem from dataclass to regular class
- Rename to_dict() to to_eval_data()
- Convert _AgentEvalData to TypedDict
- Simplify check system: unified async pattern with isawaitable
- Parallelize checks and evaluators with asyncio.gather
- Add all/any mode to tool_called_check
- Fix bool(passed) truthy bug in _coerce_result
- Remove deprecated function_evaluator/async_function_evaluator aliases
- Remove _MinimalAgent, tighten evaluate_agent signature
- Set self.name in __init__ (LocalEvaluator, FoundryEvals)
- Limit FoundryEvals to AsyncOpenAI only
- Type project_client as AIProjectClient
- Remove NotImplementedError continuous eval code
- Add evaluation samples in 02-agents/ and 03-workflows/
- Update all imports and tests (167 passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
45527eed29 · 2026-03-20 14:24:21 -07:00
History
..
2026-03-20 14:24:21 -07:00

Foundry Evals Integration Samples

These samples demonstrate evaluating agent-framework agents using Azure AI Foundry's built-in evaluators.

Available Evaluators

Category Evaluators
Agent behavior intent_resolution, task_adherence, task_completion, task_navigation_efficiency
Tool usage tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization, tool_call_success
Quality coherence, fluency, relevance, groundedness, response_completeness, similarity
Safety violence, sexual, self_harm, hate_unfairness

Samples

evaluate_agent_sample.py — Dataset Evaluation (Path 3)

The dev inner loop. Two patterns from simplest to most control:

  1. evaluate_agent() — One call: runs agent → converts → evaluates
  2. evaluate_dataset() — Run agent yourself, convert with AgentEvalConverter, inspect/modify, then evaluate
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py

evaluate_traces_sample.py — Trace & Response Evaluation (Path 1)

Evaluate what already happened — zero changes to agent code:

  1. evaluate_responses() — Evaluate Responses API responses by ID
  2. evaluate_traces() — Evaluate from OTel traces in App Insights
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py

Setup

Create a .env file with configuration as in the .env.example file in this folder.

Which sample should I start with?

  • "I want to test my agent during development"evaluate_agent_sample.py, Pattern 1
  • "I want to evaluate past agent runs"evaluate_traces_sample.py
  • "I want to inspect/modify eval data before submitting"evaluate_agent_sample.py, Pattern 2