mirror of
https://github.com/microsoft/agent-framework.git
synced 2026-06-16 21:04:09 +08:00
6acab3d1d6
* Refactor Anthropic model option and provider clients Rename the Anthropic client model option from model_id to model, add provider-specific Anthropic wrappers for Foundry, Bedrock, and Vertex, and expose them through the Anthropic, Foundry, Amazon, and Google namespaces. Update core option handling, docs, samples, and tests accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix Anthropic skills sample typing Cast the Anthropic beta client to Any in the skills sample so the pre-commit sample pyright check no longer fails on beta skills and files endpoints that are not exposed by the current SDK stubs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * undo sample mypy * Retry CI after transient external failures Retrigger PR validation after an unrelated Copilot review workflow SAML failure and a transient external tau2 git fetch failure in the Windows Python test setup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review feedback on model option merging Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address Anthropic compatibility review feedback Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * moved all to `model` * fixes for azure ai search * Python: standardize remaining sample env var names Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: fix foundry-local pyright compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * updated env vars in cicd --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6acab3d1d6
·
2026-04-01 19:00:18 +00:00
History
Foundry Evals Integration Samples
These samples demonstrate evaluating agent-framework agents using Azure AI Foundry's built-in evaluators.
Available Evaluators
| Category | Evaluators |
|---|---|
| Agent behavior | intent_resolution, task_adherence, task_completion, task_navigation_efficiency |
| Tool usage | tool_call_accuracy, tool_selection, tool_input_accuracy, tool_output_utilization, tool_call_success |
| Quality | coherence, fluency, relevance, groundedness, response_completeness, similarity |
| Safety | violence, sexual, self_harm, hate_unfairness |
Samples
evaluate_agent_sample.py — Dataset Evaluation (Path 3)
The dev inner loop. Two patterns from simplest to most control:
evaluate_agent()— One call: runs agent → converts → evaluatesFoundryEvals.evaluate()— Run agent yourself, convert withAgentEvalConverter, inspect/modify, then evaluate
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_agent_sample.py
evaluate_traces_sample.py — Trace & Response Evaluation (Path 1)
Evaluate what already happened — zero changes to agent code:
evaluate_traces(response_ids=...)— Evaluate Responses API responses by IDevaluate_traces(agent_id=...)— Evaluate agent behavior from OTel traces in App Insights
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
Setup
Create a .env file with configuration as in the .env.example file in this folder.
Which sample should I start with?
- "I want to test my agent during development" →
evaluate_agent_sample.py, Pattern 1 - "I want to evaluate past agent runs" →
evaluate_traces_sample.py - "I want to inspect/modify eval data before submitting" →
evaluate_agent_sample.py, Pattern 2