--- status: accepted contact: bentho date: 2026-02-27 deciders: bentho, markwallace-microsoft, westey-m consulted: Pratyush Mishra, Shivam Shrivastava, Manni Arora (Centrica eval scenario) informed: Agent Framework team, Foundry Evals team --- # Agent Evaluation Architecture with Azure AI Foundry Integration ## Context and Problem Statement Azure AI Foundry provides a rich evaluation service for AI agents — built-in evaluators for agent behavior (task adherence, intent resolution), tool usage (tool call accuracy, tool selection), quality (coherence, fluency, relevance), and safety (violence, self-harm, prohibited actions). Results are viewable in the Foundry portal with dashboards and comparison views. However, using Foundry Evals with an agent-framework agent today requires significant manual effort. Developers must: 1. Transform agent-framework's `Message`/`Content` types into the OpenAI-style agent message schema that Foundry evaluators expect 2. Map tool definitions from agent-framework's `FunctionTool` format to evaluator-compatible schemas 3. Manually wire up the correct Foundry data source type (`azure_ai_traces`, `jsonl`, `azure_ai_target_completions`, etc.) depending on their scenario 4. Handle App Insights trace ID queries, response ID collection, and eval polling Additionally, evaluation is a concern that extends beyond any single provider. Developers may want to use local evaluators (LLM-as-judge, regex, keyword matching), third-party evaluation libraries, or multiple providers in combination. The architecture must support this without creating a Foundry-specific lock-in at the API level. ### Functional Requirements for Agent Evaluation - **Single agents and workflows.** Evaluate both individual agent responses and multi-agent workflow results, with per-agent breakdown to pinpoint underperformance. - **One-shot and multi-turn conversations.** Capture full conversation trajectories — including tool calls and results — not just final query/response pairs. - **Conversation factoring.** Support splitting conversations into query/response in multiple ways (last turn, full trajectory, per-turn) because different factorings measure different things. - **Multiple providers, mix and match.** Run Foundry LLM-as-judge evaluators alongside fast local checks and custom evaluators on the same data, without restructuring code. - **Third-party extensibility.** Any evaluation library can participate by implementing the `Evaluator` protocol (Python) or `IAgentEvaluator` interface (.NET). No predetermined list of supported libraries — the protocol is intentionally simple (`evaluate(items) → results`) so that wrappers for libraries like DeepEval, RAGAS, or Promptfoo are straightforward to write. - **Bring your own evaluator.** Creating a custom evaluator should be as simple as writing a function. - **Evaluate without re-running.** Evaluate existing responses from logs or previous runs without invoking the agent again. ## Decision Drivers - **Zero-friction evaluation**: Developers should go from "I have an agent" to "I have eval results" with minimal code. - **Provider-agnostic API**: Core evaluation capabilities must not be tied to any specific provider. Provider configuration should be separate from the evaluation call. - **Lowest concept count**: Introduce the fewest possible new types, abstractions, and APIs for developers to learn. - **Leverage existing knowledge**: The framework already knows which agents exist, what tools they have, and what conversations occurred. Evals should use this automatically rather than requiring the developer to re-specify it. - **Foundry-native results**: When using Foundry, results should be viewable in the Foundry portal with dashboards and comparison views. - **Progressive disclosure**: Simple scenarios should be near-zero code. Advanced scenarios should build on the same primitives. - **Cross-language parity**: Design must be implementable in both Python and .NET. ## Considered Options 1. **Provider-specific functions** — Build Foundry-specific helper functions (`evaluate_agent()`, etc.) directly in the Azure package. All eval functions take Foundry connection parameters. 2. **Evaluator protocol with shared orchestration** — Define a provider-agnostic `Evaluator` protocol in the base agent library (`agent_framework` in Python, `Microsoft.Agents.AI` in .NET). Orchestration functions live alongside it. Providers implement the protocol. 3. **Full eval framework** — Build comprehensive eval infrastructure including custom evaluator definitions, scoring profiles, and reporting inside agent-framework. ## Decision Outcome Proposed option: "Evaluator protocol with shared orchestration", because it delivers the low-friction developer experience, supports multiple providers without API changes, and keeps the concept count low. ### Usage Examples #### Evaluate an agent The agent is invoked once per query by default. For statistically meaningful evaluation, provide multiple diverse queries. For measuring **consistency** (does the same query produce reliable results?), use `num_repetitions` to run each query N times independently: **Python:** ```python evals = FoundryEvals( project_client=client, model_deployment="gpt-4o", evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE], ) results = await evaluate_agent( agent=my_agent, queries=[ "What's the weather in Seattle?", "Plan a weekend trip to Portland", "What restaurants are near Pike Place?", ], evaluators=evals, ) for r in results: r.assert_passed() ``` **C#:** ```csharp var evals = new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence); AgentEvaluationResults results = await agent.EvaluateAsync( new[] { "What's the weather in Seattle?", "Plan a weekend trip to Portland", "What restaurants are near Pike Place?", }, evals); results.AssertAllPassed(); ``` `evaluate_agent` returns one `EvalResults` per evaluator. Each result contains per-item scores with the evaluated response for auditing: ``` # results[0] (FoundryEvals) EvalResults(status="completed", passed=3, failed=0, total=3) items[0]: EvalItemResult( query="What's the weather in Seattle?", response="It's currently 72°F and sunny in Seattle.", scores={"relevance": 5, "coherence": 5}) items[1]: EvalItemResult( query="Plan a weekend trip to Portland", response="Here's a 2-day Portland itinerary...", scores={"relevance": 4, "coherence": 5}) items[2]: EvalItemResult( query="What restaurants are near Pike Place?", response="Top restaurants near Pike Place Market: ...", scores={"relevance": 5, "coherence": 4}) ``` #### Measure consistency with repetitions Run each query multiple times to detect non-deterministic behavior: **Python:** ```python results = await evaluate_agent( agent=my_agent, queries=["What's the weather in Seattle?"], evaluators=evals, num_repetitions=3, # each query runs 3 times independently ) # results contain 3 items (1 query × 3 repetitions) ``` **C#:** ```csharp AgentEvaluationResults results = await agent.EvaluateAsync( new[] { "What's the weather in Seattle?" }, evals, numRepetitions: 3); // each query runs 3 times independently // results contain 3 items (1 query × 3 repetitions) ``` #### Evaluate a response you already have When you already have agent responses, pass them directly to skip re-running the agent. Each query is paired with its corresponding response: **Python:** ```python queries = ["What's the weather?", "What's the capital of France?"] responses = [await agent.run([Message("user", [q])]) for q in queries] results = await evaluate_agent( responses=responses, evaluators=evals, ) ``` **C#:** ```csharp var queries = new[] { "What's the weather?" }; var responses = new List(); foreach (var q in queries) responses.Add(await agent.RunAsync(new[] { new ChatMessage(ChatRole.User, q) })); AgentEvaluationResults results = await agent.EvaluateAsync( responses: responses, evals); ``` Each `AgentResponse` already contains the conversation (query + response), so the evaluator extracts query/response from the conversation. When you pass `responses` without `queries`, the conversation is the source of truth. #### Evaluate with conversation split strategies By default, evaluators see only the last turn (final user message → final assistant response). For multi-turn conversations, you can control how the conversation is factored for evaluation: **Python:** ```python results = await evaluate_agent( agent=agent, queries=["Plan a 3-day trip to Paris"], evaluators=evals, conversation_split=ConversationSplit.FULL, # evaluate entire trajectory ) # Or per-turn: each user→assistant exchange scored independently results = await evaluate_agent( agent=agent, queries=["Plan a 3-day trip to Paris"], evaluators=evals, conversation_split=ConversationSplit.PER_TURN, ) ``` **C#:** ```csharp // Full conversation as context AgentEvaluationResults results = await agent.EvaluateAsync( new[] { "Plan a 3-day trip to Paris" }, evals, splitter: ConversationSplitters.Full); // Per-turn splitting var items = EvalItem.PerTurnItems(conversation); // one EvalItem per user turn var results = await evals.EvaluateAsync(items); ``` With `PER_TURN`, a 3-turn conversation produces 3 scored items: ``` EvalResults(status="completed", passed=3, failed=0, total=3) items[0]: query="Plan a 3-day trip to Paris" scores={"relevance": 5} items[1]: query="What about restaurants?" scores={"relevance": 4} items[2]: query="Make it budget-friendly" scores={"relevance": 5} ``` #### Evaluate a multi-agent workflow **Python:** ```python result = await workflow.run("Plan a trip to Paris") eval_results = await evaluate_workflow( workflow=workflow, workflow_result=result, evaluators=evals, ) for r in eval_results: print(f" overall: {r.passed}/{r.total}") for name, sub in r.sub_results.items(): print(f" {name}: {sub.passed}/{sub.total}") ``` **C#:** ```csharp WorkflowRunResult result = await workflow.RunAsync("Plan a trip to Paris"); IReadOnlyList evalResults = await result.EvaluateAsync(evals); foreach (var r in evalResults) { Console.WriteLine($" overall: {r.Passed}/{r.Total}"); foreach (var (name, sub) in r.SubResults) Console.WriteLine($" {name}: {sub.Passed}/{sub.Total}"); } ``` Workflows return one result per evaluator, with sub-results per agent in the workflow: ``` EvalResults(status="completed", passed=2, failed=0, total=2) sub_results: "planner": EvalResults(passed=1, total=1) "researcher": EvalResults(passed=1, total=1) ``` #### Mix multiple providers **Python:** ```python @evaluator def is_helpful(response: str) -> bool: return len(response.split()) > 10 foundry = FoundryEvals( project_client=client, model_deployment="gpt-4o", evaluators=[FoundryEvals.RELEVANCE, FoundryEvals.COHERENCE], ) results = await evaluate_agent( agent=agent, queries=queries, evaluators=[is_helpful, keyword_check("weather"), foundry], ) ``` **C#:** ```csharp IReadOnlyList results = await agent.EvaluateAsync( queries, evaluators: new IAgentEvaluator[] { new LocalEvaluator( EvalChecks.KeywordCheck("weather"), FunctionEvaluator.Create("is_helpful", (string r) => r.Split(' ').Length > 10)), new FoundryEvals(chatConfiguration, FoundryEvals.Relevance, FoundryEvals.Coherence), }); ``` Multiple evaluators return one result each — `results[0]` is the local evaluator, `results[1]` is Foundry. #### Custom function evaluators **Python:** ```python @evaluator def mentions_city(response: str, expected_output: str) -> bool: return expected_output.lower() in response.lower() @evaluator def used_tools(conversation: list, tools: list) -> float: # ... scoring logic return score local = LocalEvaluator(mentions_city, used_tools) ``` `@evaluator` uses **parameter name injection** — the function's parameter names determine what data it receives from the `EvalItem`. Supported names: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`. Any combination is valid. **C#:** ```csharp var local = new LocalEvaluator( FunctionEvaluator.Create("mentions_city", (EvalItem item) => item.ExpectedOutput != null && item.Response.Contains(item.ExpectedOutput, StringComparison.OrdinalIgnoreCase)), FunctionEvaluator.Create("is_concise", (string response) => response.Split(' ').Length < 500)); ``` ## What To Build ### Core: Evaluator Protocol A runtime-checkable protocol that any evaluation provider implements: ```python @runtime_checkable class Evaluator(Protocol): name: str async def evaluate( self, items: Sequence[EvalItem], *, eval_name: str = "Agent Framework Eval" ) -> EvalResults: ... ``` The protocol is minimal — just `name` and `evaluate()`. ### Core: EvalItem Provider-agnostic data format for items to evaluate: ```python @dataclass class ExpectedToolCall: name: str # Tool/function name arguments: dict[str, Any] | None = None # None = don't check args @dataclass class EvalItem: conversation: list[Message] # Single source of truth tools: list[FunctionTool] | None = None # Agent's available tools context: str | None = None expected_output: str | None = None # Ground-truth for comparison expected_tool_calls: list[ExpectedToolCall] | None = None split_strategy: ConversationSplitter | None = None query: str # property — derived from conversation split response: str # property — derived from conversation split ``` `conversation` is the single source of truth. `query` and `response` are derived properties — splitting the conversation at the last user message (default) and extracting text from each side. Changing the `split_strategy` consistently changes all derived values. `tools` provides typed `FunctionTool` objects — including MCP tools, which are automatically extracted after agent runs. ### Internal: AgentEvalConverter Internal class that converts agent-framework types to `EvalItem`. Used by `evaluate_agent()` and `evaluate_workflow()` — not part of the public API: | Agent Framework | Eval Format | |---|---| | `Content.function_call` | `tool_call` in OpenAI chat format | | `Content.function_result` | `tool_result` in OpenAI chat format | | `FunctionTool` | `{name, description, parameters}` schema | | `Message` history | `conversation` list + `query`/`response` extraction | ### Core: EvalResults Rich result type with convenience properties for CI integration: ```python results.all_passed # bool: no failures or errors (recursive for workflow) results.passed # int: passing count results.failed # int: failure count results.total # int: total = passed + failed + errored results.items # list[EvalItemResult]: per-item detail with query, response, and scores results.error # str | None: error details on failure results.sub_results # dict: per-agent breakdown (workflow evals) results.report_url # str | None: portal link (Foundry) results.assert_passed() # raises AssertionError with details ``` ### Core: Orchestration Functions Provider-agnostic functions that extract data and delegate to evaluators: | Function | What it does | |---|---| | `evaluate_agent()` | Runs agent against test queries (or evaluates pre-existing `responses=`), converts to `EvalItem`s, passes to evaluator. Accepts optional `expected_output=` for ground-truth comparison, `expected_tool_calls=` for tool-correctness evaluation, and `num_repetitions=` for consistency measurement | | `evaluate_workflow()` | Extracts per-agent data from `WorkflowRunResult`, evaluates each agent and overall output. Per-agent breakdown in `sub_results`. Also accepts `num_repetitions=` | ### Core: Conversation Split Strategies Multi-turn conversations must be split into query (input) and response (output) halves for evaluation. How you split determines *what you're evaluating*: **Last-turn split** — split at the last user message. Everything up to and including it is the query context; the agent's subsequent actions are the response: ``` conversation: user1 → assistant1 → user2 → assistant2(tool) → tool_result → assistant3 query_messages: [user1, assistant1, user2] response_messages: [assistant2(tool), tool_result, assistant3] ``` This evaluates: "Given all the context so far, did the agent answer the latest question well?" Best for response quality at a specific point in the conversation. **Full-conversation split** — the first user message is the query; everything after is the response: ``` query_messages: [user1] response_messages: [assistant1, user2, assistant2(tool), tool_result, assistant3] ``` This evaluates: "Given the original request, did the entire conversation trajectory serve the user?" Best for task completion and overall conversation quality. **Per-turn split** — produces N eval items from an N-turn conversation. Each turn is evaluated with its cumulative context: ``` item 1: query = [user1], response = [assistant1] item 2: query = [user1, assistant1, user2], response = [assistant2(tool), tool_result, assistant3] ``` This evaluates each response independently. Best for fine-grained analysis and pinpointing where a conversation goes wrong. These factorings produce different scores for the same conversation. The framework ships all three as built-in strategies, defaulting to last-turn. Developers can also provide a custom splitter — a function (Python) or `IConversationSplitter` implementation (.NET) — and override the strategy at the call site or per evaluator. ### Azure AI: FoundryEvals `Evaluator` implementation backed by Azure AI Foundry: ```python class FoundryEvals: def __init__(self, *, project_client=None, openai_client=None, model_deployment: str, evaluators=None, ...) async def evaluate(self, items, *, eval_name) -> EvalResults ``` **Smart auto-detection in `evaluate()`:** - Default evaluators: relevance, coherence, task_adherence - Auto-adds `tool_call_accuracy` when items have tools/`tool_definitions` - Filters out tool evaluators for items without tools ### Azure AI: FoundryEvals Constants ```python from agent_framework.foundry import FoundryEvals evaluators = [FoundryEvals.RELEVANCE, FoundryEvals.TOOL_CALL_ACCURACY] ``` Categories: Agent behavior, Tool usage, Quality, Safety. ### Azure AI: Foundry-Specific Functions | Function | What it does | |---|---| | `evaluate_traces()` | Evaluate from stored response IDs or OTel traces | | `evaluate_foundry_target()` | Evaluate a Foundry-registered agent or deployment | ### Core: LocalEvaluator and Function Evaluators `LocalEvaluator` implements the `Evaluator` protocol for fast, API-free evaluation. It runs check functions locally — useful for inner-loop development, CI smoke tests, and combining with cloud-based evaluators. Built-in checks: - `keyword_check(*keywords)` — response must contain specified keywords - `tool_called_check(*tool_names)` — agent must have called specified tools - `tool_calls_present` — all `expected_tool_calls` names appear in conversation (unordered, extras OK) - `tool_call_args_match` — expected tool calls match on name + arguments (subset match on args) Custom function evaluators use `@evaluator` to wrap plain Python functions. The function's **parameter names** determine what data it receives from the `EvalItem`: ```python from agent_framework import evaluator, LocalEvaluator # Tier 1: Simple check — just query + response @evaluator def is_concise(response: str) -> bool: return len(response.split()) < 500 # Tier 2: Ground truth — compare against expected output @evaluator def mentions_city(response: str, expected_output: str) -> bool: return expected_output.lower() in response.lower() # Tier 3: Full context — inspect conversation and tools @evaluator def used_tools(conversation: list, tools: list) -> float: # ... scoring logic return score local = LocalEvaluator(is_concise, mentions_city, used_tools) ``` Supported parameters: `query`, `response`, `expected`, `expected_tool_calls`, `conversation`, `tools`, `context`. Return types: `bool`, `float` (≥0.5 = pass), `dict` with `score` or `passed` key, or `CheckResult`. Async functions are handled automatically — `@evaluator` detects `async def` and produces the right wrapper. ### Example: GAIA Benchmark [GAIA](https://huggingface.co/gaia-benchmark) tests real-world multi-step tasks with known expected answers. Each task has a question and a ground-truth answer, with optional file attachments. The framework accommodates GAIA's knobs (difficulty levels, file inputs, multi-step tool use) through the existing `EvalItem` fields: ```python from datasets import load_dataset from agent_framework import evaluate_agent, evaluator, LocalEvaluator gaia = load_dataset("gaia-benchmark/GAIA", "2023_level1", split="test") @evaluator def exact_match(response: str, expected_output: str) -> bool: return expected_output.strip().lower() in response.strip().lower() # Simple path — evaluate_agent handles running + expected_output stamping results = await evaluate_agent( agent=agent, queries=[task["Question"] for task in gaia], expected_output=[task["Final answer"] for task in gaia], evaluators=LocalEvaluator(exact_match), ) ``` ### Package Location - Core types and orchestration: `agent_framework._eval`, `agent_framework._local_eval` (Python), `Microsoft.Agents.AI` (.NET) - Foundry provider: `agent_framework_azure_ai._foundry_evals` (Python), `Microsoft.Agents.AI.AzureAI` (.NET) - Azure-AI re-exports core types for convenience (Python) ## Known Limitations 1. **Tool evaluators require query + agent**: Tool evaluators need tool definition schemas. When using these evaluators with `evaluate_agent(responses=...)`, provide `queries=` and pass an agent with tool definitions. 2. **`model_deployment` always required**: Could potentially be inferred from the Foundry project configuration. ## Open Questions 1. **Red teaming non-registered agents**: Requires Foundry API support for callback-based flows. 2. **Datasets with expected outputs**: A dataset abstraction for pre-populating `expected_output` values across eval runs is a natural next step but not yet designed. 3. **Multi-modal evaluation**: The `conversation` field on `EvalItem` already stores full `Message`/`Content` (Python) and `ChatMessage` (.NET) objects, which can represent multi-modal content (images, audio, structured data). Evaluators that accept the full `EvalItem` or `conversation` parameter can access this content today. However, the convenience shortcuts — `query`/`response` string projections and the `FunctionEvaluator` string overloads — are text-only. Multi-modal-aware evaluators should use the full-item path (`Func` in .NET, `conversation: list` parameter in Python). ## .NET Implementation Design ### Key Difference: MEAI Ecosystem Unlike Python, the .NET ecosystem already has `Microsoft.Extensions.AI.Evaluation` (v10.3.0) providing: - `IEvaluator` — per-item evaluation of `(messages, chatResponse) → EvaluationResult` - `CompositeEvaluator` — combines multiple evaluators - Quality evaluators — `RelevanceEvaluator`, `CoherenceEvaluator`, `GroundednessEvaluator` - Safety evaluators — `ContentHarmEvaluator`, `ProtectedMaterialEvaluator` - Metric types — `NumericMetric`, `BooleanMetric`, `StringMetric` The .NET integration uses MEAI's `IEvaluator` directly — no new evaluator interface. Our contribution is the **orchestration layer**: extension methods that run agents, extract data, call `IEvaluator` per item, and aggregate results. ### Architecture ``` ┌──────────────────────────────────────────────────────────────┐ │ Developer Code │ │ agent.EvaluateAsync(queries, evaluator) │ │ run.EvaluateAsync(evaluator) │ └────────────────┬─────────────────────────────────────────────┘ │ ┌────────────────▼─────────────────────────────────────────────┐ │ Orchestration Layer (Microsoft.Agents.AI) │ │ AgentEvaluationExtensions — runs agents, extracts data, │ │ calls IEvaluator per item, aggregates into │ │ AgentEvaluationResults │ └────────────────┬─────────────────────────────────────────────┘ │ IEvaluator (MEAI) │ ┌───────────┼────────────┐ │ │ │ ┌───▼───-┐ ┌───▼────┐ ┌────▼──────────┐ │ MEAI │ │ Local │ │ Foundry │ │ Quality│ │ Checks │ │ (cloud batch) │ │ Safety │ │ Lambdas│ │ │ └────────┘ └────────┘ └───────────────┘ ``` All evaluators implement MEAI's `IEvaluator`. The orchestration layer doesn't need to know which kind — it calls `EvaluateAsync(messages, chatResponse)` per item on all of them. `FoundryEvals` handles batching internally (buffers items, submits once, returns per-item results). ### .NET Core Types **No new evaluator interface.** Use MEAI's `IEvaluator` directly. **`AgentEvaluationResults`** — The only new type. Aggregates per-item MEAI `EvaluationResult`s across a batch of queries: ```csharp public class AgentEvaluationResults { public string Provider { get; init; } public string? ReportUrl { get; init; } // Per-item — standard MEAI EvaluationResult, unchanged public IReadOnlyList Items { get; init; } // Aggregate pass/fail derived from metric interpretations public int Passed { get; } public int Failed { get; } public int Total { get; } public bool AllPassed { get; } // Workflow: per-agent breakdown public IReadOnlyDictionary? SubResults { get; init; } public void AssertAllPassed(string? message = null); } ``` ### .NET Evaluator Implementations All implement MEAI's `IEvaluator`: **`LocalEvaluator`** — Runs lambda checks locally, returns `BooleanMetric` per check: ```csharp var local = new LocalEvaluator( FunctionEvaluator.Create("is_concise", (string response) => response.Split().Length < 500), EvalChecks.KeywordCheck("weather"), EvalChecks.ToolCalledCheck("get_weather")); ``` **MEAI evaluators** — Used directly, no adapter needed: ```csharp var quality = new CompositeEvaluator( new RelevanceEvaluator(), new CoherenceEvaluator()); ``` **`FoundryEvals`** — Implements `IEvaluator` but batches internally. On first call, buffers the item. On the last item (or when explicitly flushed), submits the batch to Foundry and distributes per-item results: ```csharp var foundry = new FoundryEvals(projectClient, "gpt-4o"); ``` ### .NET Orchestration: Extension Methods ```csharp public static class AgentEvaluationExtensions { // Evaluate an agent against test queries public static Task EvaluateAsync( this AIAgent agent, IEnumerable queries, IEvaluator evaluator, ChatConfiguration? chatConfiguration = null, IEnumerable? expectedOutput = null, CancellationToken cancellationToken = default); // Evaluate pre-existing responses (without re-running the agent) public static Task EvaluateAsync( this AIAgent agent, AgentResponse responses, IEvaluator evaluator, IEnumerable? queries = null, ChatConfiguration? chatConfiguration = null, IEnumerable? expectedOutput = null, CancellationToken cancellationToken = default); // Evaluate with multiple evaluators (one result per evaluator) public static Task> EvaluateAsync( this AIAgent agent, IEnumerable queries, IEnumerable evaluators, ChatConfiguration? chatConfiguration = null, IEnumerable? expectedOutput = null, CancellationToken cancellationToken = default); // Evaluate a workflow run with per-agent breakdown public static Task EvaluateAsync( this Run run, IEvaluator evaluator, ChatConfiguration? chatConfiguration = null, bool includeOverall = true, bool includePerAgent = true, CancellationToken cancellationToken = default); } ``` **Usage:** ```csharp // MEAI evaluators — just works var results = await agent.EvaluateAsync( queries: ["What's the weather?"], evaluator: new RelevanceEvaluator(), chatConfiguration: new ChatConfiguration(evalClient)); // Local checks var results = await agent.EvaluateAsync( queries: ["What's the weather?"], evaluator: new LocalEvaluator( EvalChecks.KeywordCheck("weather"))); // Foundry cloud var results = await agent.EvaluateAsync( queries: ["What's the weather?"], evaluator: new FoundryEvals(projectClient, "gpt-4o")); // Evaluate existing response (without re-running the agent) var response = await agent.RunAsync("What's the weather?"); var results = await agent.EvaluateAsync( responses: response, queries: ["What's the weather?"], evaluator: new FoundryEvals(projectClient, "gpt-4o")); // Mixed — one result per evaluator var results = await agent.EvaluateAsync( queries: ["What's the weather?"], evaluators: [ new LocalEvaluator(EvalChecks.KeywordCheck("weather")), new RelevanceEvaluator(), new FoundryEvals(projectClient, "gpt-4o") ], chatConfiguration: new ChatConfiguration(evalClient)); // Workflow with per-agent breakdown Run run = await workflowRunner.RunAsync(workflow, "Plan a trip"); var results = await run.EvaluateAsync( evaluator: new FoundryEvals(projectClient, "gpt-4o")); ``` ### .NET Function Evaluators Typed factory overloads (C# equivalent of Python's `@evaluator`): ```csharp public static class FunctionEvaluator { public static EvalCheck Create(string name, Func check); // response only public static EvalCheck Create(string name, Func check); // expectedOutput public static EvalCheck Create(string name, Func check); // full item public static EvalCheck Create(string name, Func check); // full control public static EvalCheck Create(string name, Func> check); // async } ``` `EvalItem` is a lightweight record used only by `FunctionEvaluator` and `LocalEvaluator` to pass context to check functions. It is not part of the `IEvaluator` interface: ```csharp public record ExpectedToolCall(string Name, IReadOnlyDictionary? Arguments = null); public sealed class EvalItem { public EvalItem(string query, string response, IReadOnlyList conversation); public string Query { get; } public string Response { get; } public IReadOnlyList Conversation { get; } public IReadOnlyList? Tools { get; set; } public string? ExpectedOutput { get; set; } public IReadOnlyList? ExpectedToolCalls { get; set; } public string? Context { get; set; } public IConversationSplitter? Splitter { get; set; } } ``` ### Workflow Data Extraction (.NET) `run.EvaluateAsync()` walks `Run.OutgoingEvents` via LINQ: 1. Pair `ExecutorInvokedEvent` / `ExecutorCompletedEvent` by `ExecutorId` 2. Extract `AgentResponseEvent` for per-agent `ChatResponse` 3. Call `evaluator.EvaluateAsync()` per invocation 4. Group by `ExecutorId` for per-agent `SubResults` 5. Use final workflow output for overall eval ### .NET Package Structure | Package | Contents | |---------|----------| | `Microsoft.Agents.AI` | `IAgentEvaluator`, `AgentEvaluationResults`, `LocalEvaluator`, `FunctionEvaluator`, `EvalChecks`, `EvalItem`, `ExpectedToolCall`, `AgentEvaluationExtensions` | | `Microsoft.Agents.AI.AzureAI` | `FoundryEvals` (provider + constants) | ### Python ↔ .NET Mapping | Python | .NET | |--------|------| | `Evaluator` protocol | `IAgentEvaluator` (our interface; MEAI provides `IEvaluator` for per-item scoring) | | `EvalItem` dataclass | `EvalItem` class | | `EvalResults` | `AgentEvaluationResults` | | `EvalItemResult` / `EvalScoreResult` | MEAI `EvaluationResult` / `EvaluationMetric` (reused) | | `LocalEvaluator` | `LocalEvaluator` (implements `IAgentEvaluator`) | | `@evaluator` | `FunctionEvaluator.Create()` overloads | | `keyword_check()` / `tool_called_check()` | `EvalChecks.KeywordCheck()` / `EvalChecks.ToolCalledCheck()` | | `tool_calls_present` / `tool_call_args_match` | (custom `FunctionEvaluator` — same pattern) | | `ExpectedToolCall` dataclass | `ExpectedToolCall` record | | `FoundryEvals` | `FoundryEvals` (implements `IAgentEvaluator`, includes evaluator name constants) | | `evaluate_agent()` | `agent.EvaluateAsync(queries, evaluator)` extension method | | `evaluate_agent(responses=)` | `agent.EvaluateAsync(responses, evaluator)` extension method | | `evaluate_workflow()` | `run.EvaluateAsync()` extension method | ## More Information - [Foundry Evals documentation](https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-approach-gen-ai) — Azure AI Foundry evaluation overview