Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)

* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): parse rubric_scores from output items + assertion helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): YAML config loader + sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: fix(evals): address PR review feedback Addresses 4 Copilot review comments on PR #6101: 1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression. 2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path. 3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run. 4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): hosted-agent-aware rubric generation * Auto-detect hosted Foundry agents in agent_as_eval_source: when the agent's chat_client exposes a string agent_name (the convention used by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a type='agent' EvalGenerationSource so the service fetches instructions and tools from the agent registry instead of relying on the local wrapper (which holds neither for hosted agents). * Add hosted_agent_version kwarg and a new agent_version field on EvalGenerationSource so PromptAgent runs can pin to a specific hosted version for reproducible rubric generation. * Add force_prompt_source escape hatch to bypass auto-detection and always emit a rendered prompt dossier - useful when the local wrapper carries overrides the service-side agent doesnt see. * Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=, not dataset_name=/dataset_version=. The mismatch would raise TypeError against the real azure-ai-projects 2.3.0a* SDK; only unmocked integration paths were affected. Tests cover: auto-detection happy path, versionless hosted agent, explicit hosted_agent_version forwarding, force_prompt_source override, non-string chat_client attrs (MagicMock test doubles) not mis-detected, agent_version forwarded through _to_sdk_source, and the corrected dataset SDK kwarg names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): accept canonical dimension_scores key per docs The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): add manual create_rubric_evaluator Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety). Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent. Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): manual rubric sample + namespace re-exports Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run. Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config. Updates the foundry-evals README with a chooser entry for the two rubric paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): remove rubric creation flows; keep consumption only Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK. Removed creation surface area: - FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers. - EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources). - BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline. - Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef. Kept (consumption surface): - GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn. - RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates. - _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key). Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): add evaluate_with_rubric_sample Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): satisfy mypy on _fetch_output_items mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(foundry-evals): hoist repeated local imports to module top Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-06-16 21:04:09 +08:00 · 2026-06-01 16:01:56 -07:00
parent f36096ce1a
commit e0d0ad16a0
11 changed files with 951 additions and 54 deletions
@@ -1,3 +1,12 @@
 FOUNDRY_PROJECT_ENDPOINT="<your-project-endpoint>"
 FOUNDRY_MODEL="<your-model-deployment>"

+# Only needed for evaluate_with_rubric_sample.py — connects to the
+# pre-existing Foundry agent that the rubric evaluator was created against.
+FOUNDRY_AGENT_NAME="<your-agent-name>"
+FOUNDRY_AGENT_VERSION="<your-agent-version>"
+
+# Only needed for evaluate_with_rubric_sample.py — references a rubric
+# evaluator you created in Foundry. Pin the version for reproducible runs.
+FOUNDRY_RUBRIC_NAME="<your-rubric-name>"
+FOUNDRY_RUBRIC_VERSION="<your-rubric-version>"
@@ -35,6 +35,34 @@ Evaluate what already happened — zero changes to agent code:
 uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
 ```

+### Referencing a rubric evaluator created in Foundry
+
+Foundry users can create rubric evaluators in the Foundry portal (or
+through the dedicated SDK / REST surface). Once an evaluator exists,
+agent-framework consumes it like any other evaluator: pass a
+`GeneratedEvaluatorRef(name=..., version=...)` in the `evaluators=`
+list and pin the version for reproducible runs.
+
+```python
+from agent_framework.foundry import FoundryEvals, GeneratedEvaluatorRef
+
+evals = FoundryEvals(
+    evaluators=[
+        GeneratedEvaluatorRef(name="reservation-policy-rubric", version="3"),
+        "relevance",
+        "coherence",
+    ],
+)
+```
+
+Quality gates on rubric output use the standard `EvalResults` helpers,
+including `assert_dimension_score_at_least(...)` for per-dimension
+thresholds.
+
+See [`evaluate_with_rubric_sample.py`](./evaluate_with_rubric_sample.py)
+for a runnable end-to-end example that combines a rubric evaluator with
+built-in evaluators and gates a per-dimension threshold.
+
 ## Setup

 Create a `.env` file with configuration as in the `.env.example` file in this folder.
@@ -44,3 +72,4 @@ Create a `.env` file with configuration as in the `.env.example` file in this fo
 - **"I want to test my agent during development"** → `evaluate_agent_sample.py`, Pattern 1
 - **"I want to evaluate past agent runs"** → `evaluate_traces_sample.py`
 - **"I want to inspect/modify eval data before submitting"** → `evaluate_agent_sample.py`, Pattern 2
+- **"I want to score against a custom rubric I created in Foundry"** → `evaluate_with_rubric_sample.py`
@@ -0,0 +1,138 @@
+# Copyright (c) Microsoft. All rights reserved.
+
+"""Evaluate a Foundry agent against a rubric evaluator that was created in Foundry.
+
+Rubric evaluators are LLM-as-judge evaluators with custom scoring dimensions
+that you define for your domain. agent-framework consumes pre-existing rubric
+evaluators — they are authored in the Foundry portal (or via the dedicated
+SDK / REST surface) and referenced here by name and version.
+
+See: https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-evaluators/rubric-evaluators
+
+This sample demonstrates:
+1. Connecting to a pre-existing Foundry agent (PromptAgent or HostedAgent).
+2. Referencing a pre-existing rubric evaluator by ``name`` and ``version``.
+3. Mixing the rubric with built-in Foundry evaluators in one run.
+4. Asserting per-dimension thresholds with
+   ``EvalResults.assert_dimension_score_at_least(...)`` for CI quality gates.
+
+Starting condition / prerequisites:
+- An Azure AI Foundry project with a deployed model.
+- A registered Foundry agent (PromptAgent or HostedAgent) in that project.
+  This is the agent the rubric is meant to evaluate.
+- A rubric evaluator already created in the Foundry portal against that
+  agent. Creating rubrics through the portal currently requires picking a
+  Foundry agent as the generation context, so this prerequisite is implied
+  by having a rubric at all.
+- Set the following in .env (see ``.env.example``):
+    - ``FOUNDRY_PROJECT_ENDPOINT``
+    - ``FOUNDRY_AGENT_NAME`` and ``FOUNDRY_AGENT_VERSION`` for the agent
+    - ``FOUNDRY_RUBRIC_NAME`` and ``FOUNDRY_RUBRIC_VERSION`` for the rubric
+    - ``FOUNDRY_MODEL`` for the rubric judge model
+"""
+
+import asyncio
+import os
+
+from agent_framework import EvalNotPassedError, evaluate_agent
+from agent_framework.foundry import FoundryAgent, FoundryChatClient, FoundryEvals, GeneratedEvaluatorRef
+from azure.identity import AzureCliCredential
+from dotenv import load_dotenv
+
+load_dotenv(override=True)
+
+
+async def main() -> None:
+    # 1. Connect to the existing Foundry agent that the rubric was created
+    #    against. PromptAgents and HostedAgents are both supported.
+    credential = AzureCliCredential()
+    project_endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
+
+    agent = FoundryAgent(
+        project_endpoint=project_endpoint,
+        agent_name=os.environ["FOUNDRY_AGENT_NAME"],
+        agent_version=os.environ.get("FOUNDRY_AGENT_VERSION"),
+        credential=credential,
+    )
+
+    # 2. Reference the pre-existing rubric evaluator by name + version.
+    #    Always pin a version for reproducible CI runs; versionless refs
+    #    resolve to "latest" and emit a warning at evaluation time.
+    rubric_name = os.environ["FOUNDRY_RUBRIC_NAME"]
+    rubric_version = os.environ["FOUNDRY_RUBRIC_VERSION"]
+    rubric = GeneratedEvaluatorRef(name=rubric_name, version=rubric_version)
+
+    # 3. Mix the rubric with built-in evaluators in a single FoundryEvals
+    #    config. FoundryEvals talks to Foundry over the project endpoint, so
+    #    we hand it a FoundryChatClient configured with the same credential.
+    eval_client = FoundryChatClient(
+        project_endpoint=project_endpoint,
+        model=os.environ["FOUNDRY_MODEL"],
+        credential=credential,
+    )
+    evals = FoundryEvals(
+        client=eval_client,
+        evaluators=[
+            rubric,
+            FoundryEvals.RELEVANCE,
+            FoundryEvals.COHERENCE,
+        ],
+    )
+
+    # =========================================================================
+    # Run evaluation
+    # =========================================================================
+    print("=" * 60)
+    print(f"Evaluating '{agent.name}' with rubric '{rubric_name}' (version {rubric_version})")
+    print("=" * 60)
+
+    results = await evaluate_agent(
+        agent=agent,
+        queries=[
+            "What's the weather like in Seattle?",
+            "Should I bring an umbrella to London tomorrow?",
+        ],
+        evaluators=evals,
+    )
+
+    for r in results:
+        print(f"Status: {r.status}")
+        print(f"Results: {r.passed}/{r.total} passed")
+        print(f"Portal: {r.report_url}")
+        if r.all_passed:
+            print("[PASS] All passed")
+        else:
+            print(f"[FAIL] {r.failed} failed")
+
+    # =========================================================================
+    # Per-dimension quality gate
+    # =========================================================================
+    # Rubric evaluators emit per-dimension scores (1–5) on top of the overall
+    # weighted score. Use assert_dimension_score_at_least to gate CI on a
+    # specific dimension — e.g., never ship if a critical dimension drops
+    # below 3.
+    #
+    # The dimension_id must match an id defined on your rubric in Foundry.
+    # ``general_quality`` is used here because it's the conventional
+    # ``always_applicable: true`` dimension in the Foundry docs' example
+    # rubric — swap it for whatever dimension id(s) your rubric actually
+    # defines.
+    print()
+    print("=" * 60)
+    print("Per-dimension quality gate")
+    print("=" * 60)
+
+    for r in results:
+        try:
+            r.assert_dimension_score_at_least(
+                "general_quality",
+                min_score=3.0,
+                evaluator=rubric_name,
+            )
+            print(f"[PASS] {r.provider}: general_quality >= 3 on every item")
+        except EvalNotPassedError as exc:
+            print(f"[FAIL] {r.provider}: dimension gate tripped: {exc}")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())