mirror of
https://github.com/microsoft/agent-framework.git
synced 2026-06-16 21:04:09 +08:00
aee1acbf8b
* Foundry Evals integration for .NET - Core evaluation framework: EvalItem, LocalEvaluator, FunctionEvaluator, EvalChecks - IAgentEvaluator interface with MeaiEvaluatorAdapter bridge - AgentEvaluationExtensions for agent.EvaluateAsync() overloads - FoundryEvals wrapping MEAI quality/safety evaluators - ConversationSplitters (LastTurn, Full) and IConversationSplitter - EvalItem.PerTurnItems() for multi-turn decomposition - HasImageContent for multimodal content detection - WorkflowEvaluationExtensions for per-agent workflow evaluation - 7 eval samples mirroring Python parity: 02-agents/Evaluation: SimpleEval, ExpectedOutputs, Multimodal 03-workflows/Evaluation: WorkflowEval 05-end-to-end/Evaluation: FoundryQuality, MixedProviders, ConversationSplits - Comprehensive unit tests (1958 passing) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rewrite FoundryEvals to use real Foundry Evals API Replace MEAI evaluator shim with actual OpenAI EvaluationClient protocol methods. FoundryEvals now creates eval definitions, submits runs, polls for completion, and fetches per-item results server-side. - New constructor: FoundryEvals(AIProjectClient, model, evaluators) - Add FoundryEvalConverter for MEAI ChatMessage -> Foundry JSON format - Add EvalId, RunId, ReportUrl to AgentEvaluationResults - All 20 built-in evaluator constants now work (agent, tool, quality, safety) - Remove Microsoft.Extensions.AI.Evaluation.Quality/Safety dependencies - Update all samples for new constructor (no more ChatConfiguration) - Replace BuildEvaluators tests with ResolveEvaluator tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add response output to CustomEvals and ExpectedOutputs samples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address review: pagination, validation, error handling, tests FoundryEvals fixes: - Add pagination for output items (has_more/after cursor) - Add guard clauses for pollIntervalSeconds/timeoutSeconds <= 0 - Fix double TryGetProperty for passed field parsing - Throw on all-tool-evaluators with no tool definitions - Fix XML doc (default 300s, not 180s) New tests (30 added, 1989 total): - EvalChecks: NonEmpty, ContainsExpected (pass/fail/skip/case), HasImageContent, ToolCallsPresent - FoundryEvalConverter: ConvertMessage (text, image, function call, function results fan-out, empty fallback, mixed content), ConvertEvalItem, BuildTestingCriteria (quality/agent/tool/groundedness data mappings), BuildItemSchema Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix review: null-refs, Data.ToString() bug, ContainsExpected, add tests - Fix NullReferenceException in sample Response display (pattern matching) - Fix WorkflowEvaluationExtensions Data?.ToString() producing type names instead of message text (pattern-match ChatMessage/AgentResponse/list) - Change EvalChecks.ContainsExpected to return Passed=false when no ExpectedOutput (was silently passing, masking misconfiguration) - Add EvalItem constructor tests with LastTurn/Full/null splitters - Add FoundryEvalConverter.ConvertMessage DataContent (base64 image) test - Add ExtractAgentData tests with ChatMessage, list, and AgentResponse data Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix review: conversation fidelity, eval caching, fallback tests - WorkflowEvaluationExtensions: preserve full response messages (tool calls, intermediate) instead of synthetic 2-message conversation. Cast completed Data to AgentResponse and use Messages when available, fallback to text. - FoundryEvals: cache evalId per schema shape (hasContext, hasTools) so subsequent EvaluateAsync calls create runs under the same eval definition. - MeaiEvaluatorAdapter: code already correctly passes queryMessages (not full conversation) to IEvaluator — no change needed, verified by inspection. - Add tests: AgentResponse full messages preservation, unknown object ToString() fallback for ExtractAgentData. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rename AzureAI→Foundry: move eval files, update references - Move FoundryEvals.cs and FoundryEvalConverter.cs from Microsoft.Agents.AI.AzureAI to Microsoft.Agents.AI.Foundry - Update namespace from AzureAI to Foundry in both files - Add explicit usings required by Foundry project (no implicit usings) - Move FoundryEvalConverter tests to Foundry.UnitTests project (avoids ReplacingRedactor type conflict from dual project refs) - Update all sample csproj references and using statements - Remove Foundry project reference from AI UnitTests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * PR review round 4: wire up tool extraction, remove eval cache, fix null safety - BuildEvalItem: extract tools from agent via GetService<ChatOptions>() into EvalItem.Tools (Python parity) - FoundryEvals: remove eval ID cache - each call creates fresh definition (matches Python behavior) - FoundryEvals: replace null-forgiving operators with descriptive InvalidOperationException - MixedProviders sample: remove unnecessary explicit PackageReferences (transitively provided) - FoundryEvalConverter: document that tool results take precedence over text content - Add LocalEvaluator zero-checks test documenting 0 metrics = failed behavior Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python-dotnet parity: 9 feature gaps filled New checks: - ToolCallArgsMatch() — verify tool call names + argument subset match - ToolCalledCheck(ToolCalledMode.Any, ...) — match any of the specified tools - ToolCalledMode enum (All/Any) FoundryEvals enhancements: - Default evaluators now [Relevance, Coherence, TaskAdherence] (was Relevance, Coherence) - Auto-add ToolCallAccuracy when items have tool definitions - EvaluateTracesAsync — evaluate by response_ids, trace_ids, or agent_id - EvaluateFoundryTargetAsync — evaluate deployed Foundry targets Result type enrichment: - AgentEvaluationResults: added Status, Error, PerEvaluator, DetailedItems - New EvalItemResult/EvalScoreResult/PerEvaluatorResult types - FoundryEvals populates all new fields from API responses Workflow fix: - Skip internal executors (_*, input-conversation, end-conversation, end) Tests: 8 new tests covering ToolCallArgsMatch, ToolCalledMode.Any, internal executor filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add MeaiEvaluatorAdapter and PerTurnItems edge case tests - 3 tests for MeaiEvaluatorAdapter: query message forwarding, synthetic response fallback, multiple items aggregation - 3 tests for EvalItem.PerTurnItems: empty conversation, no user messages, system+assistant only - StubEvaluator and StubChatClient test helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Blocking link check for outdated package in DevUI. * Replace Dictionary<string, object> payloads with typed wire models Introduce internal FoundryEvalWireModels.cs with compile-time-safe types for the OpenAI Evals API wire format. The OpenAI .NET SDK (2.9.1) only provides protocol-level methods with BinaryContent/ClientResult — no typed request models. These internal models replace scattered dictionary literals with [JsonPropertyName]-annotated classes, giving: - Compile-time safety (typos become build errors) - Single point of change when the API evolves - IntelliSense discoverability - Cleaner serialization via JsonPolymorphic for content items Models: WireContentItem hierarchy (text, image, tool_call, tool_result), WireMessage, WireEvalItemPayload, WireTestingCriterion, WireItemSchema, WireCreateEvalRequest, WireCreateRunRequest, and data source variants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip metric when Foundry returns neither score nor passed When an evaluator returns no score and no passed value, the previous code created BooleanMetric(name, false), which falsely failed items via ItemPassed. Now we skip the MEAI metric entirely for indeterminate results — the raw data remains available in DetailedItems for diagnostics. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address PR #4914 review comments: fix tool evaluator bug and add tests - Fix duplicate ToolCallAccuracy: resolve evaluator names before checking against ToolEvaluators set (Comment 2) - Make FilterToolEvaluators internal for testability; add tests for the ArgumentException edge case when all evaluators are tool-type (Comment 3) - Add CancellationToken test for LocalEvaluator (Comment 4) - Add EvaluateAsync integration test on Run with sequential workflow and per-agent SubResults verification (Comment 5) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Address Peter's review comments on PR #4914 - Add trailing newline to Evaluation_FoundryQuality.csproj (Comment 6) - Make evaluator name lookups case-insensitive: switch BuiltinEvaluators, ToolEvaluators, AgentEvaluators, and ResolveEvaluator's StartsWith check from Ordinal to OrdinalIgnoreCase (Comment 7) - Add Trace.TraceWarning when Foundry returns fewer results than submitted items, indicating expected vs actual count before padding (Comment 8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add Microsoft.Extensions.AI.Evaluation packages to Directory.Packages.props These were removed in #5269 as unused, but are needed by the Foundry and core evaluation integration added in this PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: alliscode <bentho@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
327 lines
12 KiB
C#
327 lines
12 KiB
C#
// Copyright (c) Microsoft. All rights reserved.
|
|
|
|
using System.Collections.Generic;
|
|
using System.Threading.Tasks;
|
|
using Microsoft.Extensions.AI;
|
|
|
|
namespace Microsoft.Agents.AI.Workflows.UnitTests;
|
|
|
|
/// <summary>
|
|
/// Tests for <see cref="WorkflowEvaluationExtensions.ExtractAgentData"/>.
|
|
/// </summary>
|
|
public sealed class WorkflowEvaluationTests
|
|
{
|
|
[Fact]
|
|
public void ExtractAgentData_EmptyEvents_ReturnsEmpty()
|
|
{
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(new List<WorkflowEvent>(), splitter: null);
|
|
|
|
Assert.Empty(result);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_MatchedPair_ReturnsItem()
|
|
{
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "What is the weather?"),
|
|
new ExecutorCompletedEvent("agent-1", "It's sunny."),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.True(result.ContainsKey("agent-1"));
|
|
Assert.Single(result["agent-1"]);
|
|
Assert.Equal("What is the weather?", result["agent-1"][0].Query);
|
|
Assert.Equal("It's sunny.", result["agent-1"][0].Response);
|
|
Assert.Equal(2, result["agent-1"][0].Conversation.Count);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_UnmatchedInvocation_NotIncluded()
|
|
{
|
|
// An invocation without a matching completion should not appear in results
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "Hello"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Empty(result);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_CompletionWithoutInvocation_NotIncluded()
|
|
{
|
|
// A completion without a prior invocation should not appear in results
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorCompletedEvent("agent-1", "Response"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Empty(result);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_MultipleAgents_SeparatedByExecutorId()
|
|
{
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "Q1"),
|
|
new ExecutorInvokedEvent("agent-2", "Q2"),
|
|
new ExecutorCompletedEvent("agent-1", "A1"),
|
|
new ExecutorCompletedEvent("agent-2", "A2"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Equal(2, result.Count);
|
|
Assert.Equal("Q1", result["agent-1"][0].Query);
|
|
Assert.Equal("A1", result["agent-1"][0].Response);
|
|
Assert.Equal("Q2", result["agent-2"][0].Query);
|
|
Assert.Equal("A2", result["agent-2"][0].Response);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_DuplicateExecutorId_LastInvocationUsed()
|
|
{
|
|
// If the same executor is invoked twice before completing,
|
|
// the second invocation overwrites the first
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "First question"),
|
|
new ExecutorInvokedEvent("agent-1", "Second question"),
|
|
new ExecutorCompletedEvent("agent-1", "Answer"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Single(result["agent-1"]);
|
|
Assert.Equal("Second question", result["agent-1"][0].Query);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_MultipleRoundsForSameExecutor_AllCaptured()
|
|
{
|
|
// Same executor invoked→completed twice (sequential rounds)
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "Q1"),
|
|
new ExecutorCompletedEvent("agent-1", "A1"),
|
|
new ExecutorInvokedEvent("agent-1", "Q2"),
|
|
new ExecutorCompletedEvent("agent-1", "A2"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result); // one executor
|
|
Assert.Equal(2, result["agent-1"].Count); // two items
|
|
Assert.Equal("Q1", result["agent-1"][0].Query);
|
|
Assert.Equal("Q2", result["agent-1"][1].Query);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_NullData_UsesEmptyString()
|
|
{
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", null!),
|
|
new ExecutorCompletedEvent("agent-1", null),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Equal(string.Empty, result["agent-1"][0].Query);
|
|
Assert.Equal(string.Empty, result["agent-1"][0].Response);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_WithSplitter_SetOnItems()
|
|
{
|
|
var splitter = ConversationSplitters.LastTurn;
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "Q"),
|
|
new ExecutorCompletedEvent("agent-1", "A"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter);
|
|
|
|
Assert.Equal(splitter, result["agent-1"][0].Splitter);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_ChatMessageData_ExtractsText()
|
|
{
|
|
// When Data is a ChatMessage, the fix should extract .Text instead of type name
|
|
var queryMsg = new ChatMessage(ChatRole.User, "What is the weather?");
|
|
var responseMsg = new ChatMessage(ChatRole.Assistant, "It's sunny.");
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", queryMsg),
|
|
new ExecutorCompletedEvent("agent-1", responseMsg),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Equal("What is the weather?", result["agent-1"][0].Query);
|
|
Assert.Equal("It's sunny.", result["agent-1"][0].Response);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_ChatMessageListData_ExtractsLastUserText()
|
|
{
|
|
// When Data is IReadOnlyList<ChatMessage>, extract last user message text
|
|
IReadOnlyList<ChatMessage> messages = new List<ChatMessage>
|
|
{
|
|
new(ChatRole.User, "First question"),
|
|
new(ChatRole.Assistant, "First answer"),
|
|
new(ChatRole.User, "Follow-up question"),
|
|
};
|
|
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", messages),
|
|
new ExecutorCompletedEvent("agent-1", "Response text"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Equal("Follow-up question", result["agent-1"][0].Query);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_AgentResponseData_ExtractsText()
|
|
{
|
|
// When completed Data is an AgentResponse, extract .Text
|
|
var agentResponse = new AgentResponse(new ChatMessage(ChatRole.Assistant, "Agent says hello"));
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "Hi there"),
|
|
new ExecutorCompletedEvent("agent-1", agentResponse),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Equal("Hi there", result["agent-1"][0].Query);
|
|
Assert.Equal("Agent says hello", result["agent-1"][0].Response);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_AgentResponseData_PreservesFullMessages()
|
|
{
|
|
// When completed Data is an AgentResponse, the conversation should include
|
|
// all response messages (tool calls, intermediate, etc.) not just a text summary
|
|
var toolCallMsg = new ChatMessage(ChatRole.Assistant, [new FunctionCallContent("call_1", "get_weather", new Dictionary<string, object?> { ["city"] = "Seattle" })]);
|
|
var toolResultMsg = new ChatMessage(ChatRole.Tool, [new FunctionResultContent("call_1", "Sunny, 72°F")]);
|
|
var finalMsg = new ChatMessage(ChatRole.Assistant, "It's sunny and 72°F in Seattle.");
|
|
var agentResponse = new AgentResponse
|
|
{
|
|
Messages = [toolCallMsg, toolResultMsg, finalMsg],
|
|
};
|
|
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", "What's the weather?"),
|
|
new ExecutorCompletedEvent("agent-1", agentResponse),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
// Should have user query + all 3 response messages
|
|
Assert.Equal(4, result["agent-1"][0].Conversation.Count);
|
|
Assert.Equal(ChatRole.User, result["agent-1"][0].Conversation[0].Role);
|
|
Assert.Equal(ChatRole.Assistant, result["agent-1"][0].Conversation[1].Role);
|
|
Assert.Equal(ChatRole.Tool, result["agent-1"][0].Conversation[2].Role);
|
|
Assert.Equal(ChatRole.Assistant, result["agent-1"][0].Conversation[3].Role);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_UnknownObjectData_UsesToString()
|
|
{
|
|
// When Data is an unknown object type, the ToString() fallback should produce
|
|
// the string representation (not a type name for known types)
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("agent-1", 42),
|
|
new ExecutorCompletedEvent("agent-1", 3.14),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.Equal("42", result["agent-1"][0].Query);
|
|
Assert.Equal("3.14", result["agent-1"][0].Response);
|
|
}
|
|
|
|
[Fact]
|
|
public void ExtractAgentData_SkipsInternalExecutors()
|
|
{
|
|
var events = new List<WorkflowEvent>
|
|
{
|
|
new ExecutorInvokedEvent("_internal", "internal query"),
|
|
new ExecutorCompletedEvent("_internal", "internal response"),
|
|
new ExecutorInvokedEvent("input-conversation", "start"),
|
|
new ExecutorCompletedEvent("input-conversation", "done"),
|
|
new ExecutorInvokedEvent("end-conversation", "end query"),
|
|
new ExecutorCompletedEvent("end-conversation", "end response"),
|
|
new ExecutorInvokedEvent("end", "end query"),
|
|
new ExecutorCompletedEvent("end", "end response"),
|
|
new ExecutorInvokedEvent("real-agent", "real query"),
|
|
new ExecutorCompletedEvent("real-agent", "real response"),
|
|
};
|
|
|
|
var result = WorkflowEvaluationExtensions.ExtractAgentData(events, splitter: null);
|
|
|
|
Assert.Single(result);
|
|
Assert.True(result.ContainsKey("real-agent"));
|
|
Assert.DoesNotContain("_internal", result.Keys);
|
|
Assert.DoesNotContain("input-conversation", result.Keys);
|
|
Assert.DoesNotContain("end-conversation", result.Keys);
|
|
Assert.DoesNotContain("end", result.Keys);
|
|
}
|
|
|
|
// ---------------------------------------------------------------
|
|
// EvaluateAsync integration test
|
|
// ---------------------------------------------------------------
|
|
|
|
[Fact]
|
|
public async Task EvaluateAsync_WithSequentialWorkflow_ReturnsPerAgentSubResultsAsync()
|
|
{
|
|
// Arrange: two agents in a sequential workflow
|
|
var agent1 = new TestEchoAgent(name: "agent-one");
|
|
var agent2 = new TestEchoAgent(name: "agent-two");
|
|
var workflow = AgentWorkflowBuilder.BuildSequential(agent1, agent2);
|
|
var input = new List<ChatMessage> { new(ChatRole.User, "Hello world") };
|
|
|
|
var evaluator = new LocalEvaluator(
|
|
FunctionEvaluator.Create("has_content", (EvalItem item) => item.Conversation.Count > 0));
|
|
|
|
// Act
|
|
await using var run = await InProcessExecution.RunAsync(workflow, input);
|
|
var results = await run.EvaluateAsync(evaluator, includeOverall: false, includePerAgent: true);
|
|
|
|
// Assert — results returned
|
|
Assert.NotNull(results);
|
|
|
|
// Assert — per-agent sub-results are populated
|
|
Assert.NotNull(results.SubResults);
|
|
Assert.True(results.SubResults.Count >= 2, $"Expected at least 2 agent sub-results, got {results.SubResults.Count}");
|
|
|
|
// Each sub-result should have evaluated items
|
|
foreach (var (agentId, subResult) in results.SubResults)
|
|
{
|
|
Assert.True(subResult.Total > 0, $"Agent '{agentId}' should have at least one evaluated item");
|
|
}
|
|
}
|
|
}
|