3 Commits

  • .NET: Align Foundry sample environment variables and credentials. (#6422)
    * dotnet: refresh Foundry sample guidance
    
    Carry forward the still-relevant sample guidance and Foundry-specific documentation fixes from the old stacked sample migration work, adapted to the current repo layout and policy.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * dotnet: rename Foundry sample env vars
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * dotnet: remove persistent provider sample
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * dotnet: drop SAMPLE_GUIDELINES.md from this PR
    
    Defer the guidelines doc and its cross-link to a follow-on PR to avoid broken-link failures in CI.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * dotnet: add DefaultAzureCredential warning to remaining samples
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * dotnet: address PR review feedback
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
  • .NET: feat(evals): add ground_truth/expected_output support for workflow evaluation (#5755)
    * .NET: feat(evals): add ground_truth/expected_output support for workflow eval
    
    Brings .NET to parity with Python PR #5234 for issue #5135:
    
    - Add expectedOutput parameter to Run.EvaluateAsync (workflow) and stamp on the overall EvalItem.ExpectedOutput.
    - Map EvalItem.ExpectedOutput -> ground_truth in the Foundry JSONL payload, item_schema, and data_mapping for similarity.
    - Add GroundTruthEvaluators set (currently builtin.similarity) and a FindMissingGroundTruthEvaluators helper.
    - Fail fast with InvalidOperationException when a ground-truth evaluator is selected but no item provides an ExpectedOutput, instead of surfacing a remote provider error.
    - Add tests in FoundryEvalConverterTests and WorkflowEvaluationTests.
    - Add Evaluation_WorkflowExpectedOutputs sample (workflow + Foundry similarity).
    
    Fixes microsoft/agent-framework#5135 (.NET side).
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Address review: relax BuildOverallItem events to IReadOnlyList<WorkflowEvent>
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Sample: disable per-agent breakdown when using reference-based evaluator
    
    Per-agent EvalItems are intentionally left without ExpectedOutput, so the new fail-fast validation in FoundryEvals would throw when Similarity is invoked for per-agent items. Pass includePerAgent: false in the workflow + similarity sample, and document this gotcha in the EvaluateAsync XML doc.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix BuildOverallItem: fall back to last ExecutorCompletedEvent
    
    AgentResponseEvent is only emitted when AIAgentHostOptions.EmitAgentResponseEvents is enabled, which is not the default for WorkflowBuilder(agent).AddEdge(...). When it is absent, fall back to the last non-internal ExecutorCompletedEvent whose Data is an AgentResponse / ChatMessage / string so the overall EvalItem (and any expectedOutput) is produced. Without this, samples wired up the standard way returned 0 evaluation items.
    
    Update test to cover the fallback path.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Sample: enable EmitAgentResponseEvents; eval throws clear error when no overall response found
    
    Root cause of '0 results': AIAgentHostExecutor only emits AgentResponseEvent when AIAgentHostOptions.EmitAgentResponseEvents is true (default false). For ordinary AIAgent executors the runtime's ExecutorCompletedEvent.Data is null, so the prior fallback couldn't find a final response either.
    
    Sample now builds executors with EmitAgentResponseEvents=true via BindAsExecutor(hostOptions). EvaluateAsync now throws InvalidOperationException with a remediation hint when the user supplies expectedOutput but no overall final response can be located, instead of silently returning 0/0.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Guard against null sample/error/usage/datasource_item in ParseDetailedItem
    
    Foundry eval responses can have these properties present with JSON null
    or non-object values, which caused JsonElement.TryGetProperty to throw
    'requires Object, has Null'. Check ValueKind == Object before drilling in.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Address PR review: reorder expectedOutput, tighten ground-truth check, add fail-fast test
    
    * WorkflowEvaluationExtensions.EvaluateAsync: move 'expectedOutput' to
      after 'splitter' so the original positional contract of (splitter,
      cancellationToken) is preserved for existing callers.
    * FoundryEvals: require ALL items to carry ExpectedOutput when a
      ground-truth evaluator is selected (e.g. similarity), not just any.
      Reference-based evaluators score per-item, so a single missing GT
      would still surface as a provider-side validation error. Updated
      fail-fast message accordingly.
    * WorkflowEvaluationTests: add EvaluateAsync_WithExpectedOutputButNoFinalResponse_ThrowsAsync
      to verify the InvalidOperationException is thrown (and that the
      message mentions EmitAgentResponseEvents).
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fail-fast on missing overall item regardless of expectedOutput; harden BuildOverallItem default
    
    * EvaluateAsync now throws InvalidOperationException whenever 'includeOverall'
      is requested but BuildOverallItem cannot produce an item, instead of only
      when 'expectedOutput' is supplied. Same misconfiguration (agents not bound
      with EmitAgentResponseEvents) used to silently return empty results — now
      it surfaces a clear, actionable error in both cases.
    * BuildOverallItem switch default now throws instead of returning null. The
      preceding for-loop already constrains Data to AgentResponse/ChatMessage/
      string, so reaching default would indicate a contract drift; throw to make
      the bug visible.
    * Test renamed and broadened to verify the throw fires without expectedOutput.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: alliscode <25218250+alliscode@users.noreply.github.com>
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
  • .NET: Foundry Evals integration for .NET (#4914)
    * Foundry Evals integration for .NET
    
    - Core evaluation framework: EvalItem, LocalEvaluator, FunctionEvaluator, EvalChecks
    - IAgentEvaluator interface with MeaiEvaluatorAdapter bridge
    - AgentEvaluationExtensions for agent.EvaluateAsync() overloads
    - FoundryEvals wrapping MEAI quality/safety evaluators
    - ConversationSplitters (LastTurn, Full) and IConversationSplitter
    - EvalItem.PerTurnItems() for multi-turn decomposition
    - HasImageContent for multimodal content detection
    - WorkflowEvaluationExtensions for per-agent workflow evaluation
    - 7 eval samples mirroring Python parity:
      02-agents/Evaluation: SimpleEval, ExpectedOutputs, Multimodal
      03-workflows/Evaluation: WorkflowEval
      05-end-to-end/Evaluation: FoundryQuality, MixedProviders, ConversationSplits
    - Comprehensive unit tests (1958 passing)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Rewrite FoundryEvals to use real Foundry Evals API
    
    Replace MEAI evaluator shim with actual OpenAI EvaluationClient protocol
    methods. FoundryEvals now creates eval definitions, submits runs, polls
    for completion, and fetches per-item results server-side.
    
    - New constructor: FoundryEvals(AIProjectClient, model, evaluators)
    - Add FoundryEvalConverter for MEAI ChatMessage -> Foundry JSON format
    - Add EvalId, RunId, ReportUrl to AgentEvaluationResults
    - All 20 built-in evaluator constants now work (agent, tool, quality, safety)
    - Remove Microsoft.Extensions.AI.Evaluation.Quality/Safety dependencies
    - Update all samples for new constructor (no more ChatConfiguration)
    - Replace BuildEvaluators tests with ResolveEvaluator tests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Add response output to CustomEvals and ExpectedOutputs samples
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Address review: pagination, validation, error handling, tests
    
    FoundryEvals fixes:
    - Add pagination for output items (has_more/after cursor)
    - Add guard clauses for pollIntervalSeconds/timeoutSeconds <= 0
    - Fix double TryGetProperty for passed field parsing
    - Throw on all-tool-evaluators with no tool definitions
    - Fix XML doc (default 300s, not 180s)
    
    New tests (30 added, 1989 total):
    - EvalChecks: NonEmpty, ContainsExpected (pass/fail/skip/case),
      HasImageContent, ToolCallsPresent
    - FoundryEvalConverter: ConvertMessage (text, image, function call,
      function results fan-out, empty fallback, mixed content),
      ConvertEvalItem, BuildTestingCriteria (quality/agent/tool/groundedness
      data mappings), BuildItemSchema
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix review: null-refs, Data.ToString() bug, ContainsExpected, add tests
    
    - Fix NullReferenceException in sample Response display (pattern matching)
    - Fix WorkflowEvaluationExtensions Data?.ToString() producing type names
      instead of message text (pattern-match ChatMessage/AgentResponse/list)
    - Change EvalChecks.ContainsExpected to return Passed=false when no
      ExpectedOutput (was silently passing, masking misconfiguration)
    - Add EvalItem constructor tests with LastTurn/Full/null splitters
    - Add FoundryEvalConverter.ConvertMessage DataContent (base64 image) test
    - Add ExtractAgentData tests with ChatMessage, list, and AgentResponse data
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Fix review: conversation fidelity, eval caching, fallback tests
    
    - WorkflowEvaluationExtensions: preserve full response messages (tool calls,
      intermediate) instead of synthetic 2-message conversation. Cast completed
      Data to AgentResponse and use Messages when available, fallback to text.
    - FoundryEvals: cache evalId per schema shape (hasContext, hasTools) so
      subsequent EvaluateAsync calls create runs under the same eval definition.
    - MeaiEvaluatorAdapter: code already correctly passes queryMessages (not full
      conversation) to IEvaluator — no change needed, verified by inspection.
    - Add tests: AgentResponse full messages preservation, unknown object
      ToString() fallback for ExtractAgentData.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Rename AzureAI→Foundry: move eval files, update references
    
    - Move FoundryEvals.cs and FoundryEvalConverter.cs from
      Microsoft.Agents.AI.AzureAI to Microsoft.Agents.AI.Foundry
    - Update namespace from AzureAI to Foundry in both files
    - Add explicit usings required by Foundry project (no implicit usings)
    - Move FoundryEvalConverter tests to Foundry.UnitTests project
      (avoids ReplacingRedactor type conflict from dual project refs)
    - Update all sample csproj references and using statements
    - Remove Foundry project reference from AI UnitTests
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * PR review round 4: wire up tool extraction, remove eval cache, fix null safety
    
    - BuildEvalItem: extract tools from agent via GetService<ChatOptions>() into EvalItem.Tools (Python parity)
    - FoundryEvals: remove eval ID cache - each call creates fresh definition (matches Python behavior)
    - FoundryEvals: replace null-forgiving operators with descriptive InvalidOperationException
    - MixedProviders sample: remove unnecessary explicit PackageReferences (transitively provided)
    - FoundryEvalConverter: document that tool results take precedence over text content
    - Add LocalEvaluator zero-checks test documenting 0 metrics = failed behavior
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Python-dotnet parity: 9 feature gaps filled
    
    New checks:
    - ToolCallArgsMatch() — verify tool call names + argument subset match
    - ToolCalledCheck(ToolCalledMode.Any, ...) — match any of the specified tools
    - ToolCalledMode enum (All/Any)
    
    FoundryEvals enhancements:
    - Default evaluators now [Relevance, Coherence, TaskAdherence] (was Relevance, Coherence)
    - Auto-add ToolCallAccuracy when items have tool definitions
    - EvaluateTracesAsync — evaluate by response_ids, trace_ids, or agent_id
    - EvaluateFoundryTargetAsync — evaluate deployed Foundry targets
    
    Result type enrichment:
    - AgentEvaluationResults: added Status, Error, PerEvaluator, DetailedItems
    - New EvalItemResult/EvalScoreResult/PerEvaluatorResult types
    - FoundryEvals populates all new fields from API responses
    
    Workflow fix:
    - Skip internal executors (_*, input-conversation, end-conversation, end)
    
    Tests: 8 new tests covering ToolCallArgsMatch, ToolCalledMode.Any, internal executor filtering
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Add MeaiEvaluatorAdapter and PerTurnItems edge case tests
    
    - 3 tests for MeaiEvaluatorAdapter: query message forwarding, synthetic
      response fallback, multiple items aggregation
    - 3 tests for EvalItem.PerTurnItems: empty conversation, no user messages,
      system+assistant only
    - StubEvaluator and StubChatClient test helpers
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Blocking link check for outdated package in DevUI.
    
    * Replace Dictionary<string, object> payloads with typed wire models
    
    Introduce internal FoundryEvalWireModels.cs with compile-time-safe types
    for the OpenAI Evals API wire format. The OpenAI .NET SDK (2.9.1) only
    provides protocol-level methods with BinaryContent/ClientResult — no
    typed request models. These internal models replace scattered dictionary
    literals with [JsonPropertyName]-annotated classes, giving:
    
    - Compile-time safety (typos become build errors)
    - Single point of change when the API evolves
    - IntelliSense discoverability
    - Cleaner serialization via JsonPolymorphic for content items
    
    Models: WireContentItem hierarchy (text, image, tool_call, tool_result),
    WireMessage, WireEvalItemPayload, WireTestingCriterion, WireItemSchema,
    WireCreateEvalRequest, WireCreateRunRequest, and data source variants.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Skip metric when Foundry returns neither score nor passed
    
    When an evaluator returns no score and no passed value, the previous
    code created BooleanMetric(name, false), which falsely failed items
    via ItemPassed. Now we skip the MEAI metric entirely for indeterminate
    results — the raw data remains available in DetailedItems for diagnostics.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Address PR #4914 review comments: fix tool evaluator bug and add tests
    
    - Fix duplicate ToolCallAccuracy: resolve evaluator names before checking
      against ToolEvaluators set (Comment 2)
    - Make FilterToolEvaluators internal for testability; add tests for the
      ArgumentException edge case when all evaluators are tool-type (Comment 3)
    - Add CancellationToken test for LocalEvaluator (Comment 4)
    - Add EvaluateAsync integration test on Run with sequential workflow and
      per-agent SubResults verification (Comment 5)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Address Peter's review comments on PR #4914
    
    - Add trailing newline to Evaluation_FoundryQuality.csproj (Comment 6)
    - Make evaluator name lookups case-insensitive: switch BuiltinEvaluators,
      ToolEvaluators, AgentEvaluators, and ResolveEvaluator's StartsWith check
      from Ordinal to OrdinalIgnoreCase (Comment 7)
    - Add Trace.TraceWarning when Foundry returns fewer results than submitted
      items, indicating expected vs actual count before padding (Comment 8)
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    * Add Microsoft.Extensions.AI.Evaluation packages to Directory.Packages.props
    
    These were removed in #5269 as unused, but are needed by the Foundry
    and core evaluation integration added in this PR.
    
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
    
    ---------
    
    Co-authored-by: alliscode <bentho@microsoft.com>
    Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>