Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)

* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators=

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): parse rubric_scores from output items + assertion helpers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): YAML config loader + sample

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: fix(evals): address PR review feedback

Addresses 4 Copilot review comments on PR #6101:

1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression.

2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path.

3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run.

4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): hosted-agent-aware rubric generation

* Auto-detect hosted Foundry agents in agent_as_eval_source: when the
  agent's chat_client exposes a string agent_name (the convention used
  by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a
  type='agent' EvalGenerationSource so the service fetches instructions
  and tools from the agent registry instead of relying on the local
  wrapper (which holds neither for hosted agents).
* Add hosted_agent_version kwarg and a new agent_version field on
  EvalGenerationSource so PromptAgent runs can pin to a specific hosted
  version for reproducible rubric generation.
* Add force_prompt_source escape hatch to bypass auto-detection and
  always emit a rendered prompt dossier - useful when the local wrapper
  carries overrides the service-side agent doesnt see.
* Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=,
  not dataset_name=/dataset_version=. The mismatch would raise TypeError
  against the real azure-ai-projects 2.3.0a* SDK; only unmocked
  integration paths were affected.

Tests cover: auto-detection happy path, versionless hosted agent,
explicit hosted_agent_version forwarding, force_prompt_source override,
non-string chat_client attrs (MagicMock test doubles) not mis-detected,
agent_version forwarded through _to_sdk_source, and the corrected
dataset SDK kwarg names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): accept canonical dimension_scores key per docs

The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): add manual create_rubric_evaluator

Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety).

Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent.

Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): manual rubric sample + namespace re-exports

Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run.

Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config.

Updates the foundry-evals README with a chooser entry for the two rubric paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): remove rubric creation flows; keep consumption only

Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK.

Removed creation surface area:

- FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers.

- EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources).

- BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline.

- Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef.

Kept (consumption surface):

- GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn.

- RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates.

- _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key).

Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): add evaluate_with_rubric_sample

Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): satisfy mypy on _fetch_output_items

mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link

The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(foundry-evals): hoist repeated local imports to module top

Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Ben Thomas
2026-06-01 16:01:56 -07:00
committed by GitHub
Unverified
parent f36096ce1a
commit e0d0ad16a0
11 changed files with 951 additions and 54 deletions
@@ -1,3 +1,12 @@
FOUNDRY_PROJECT_ENDPOINT="<your-project-endpoint>"
FOUNDRY_MODEL="<your-model-deployment>"
# Only needed for evaluate_with_rubric_sample.py — connects to the
# pre-existing Foundry agent that the rubric evaluator was created against.
FOUNDRY_AGENT_NAME="<your-agent-name>"
FOUNDRY_AGENT_VERSION="<your-agent-version>"
# Only needed for evaluate_with_rubric_sample.py — references a rubric
# evaluator you created in Foundry. Pin the version for reproducible runs.
FOUNDRY_RUBRIC_NAME="<your-rubric-name>"
FOUNDRY_RUBRIC_VERSION="<your-rubric-version>"
@@ -35,6 +35,34 @@ Evaluate what already happened — zero changes to agent code:
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
```
### Referencing a rubric evaluator created in Foundry
Foundry users can create rubric evaluators in the Foundry portal (or
through the dedicated SDK / REST surface). Once an evaluator exists,
agent-framework consumes it like any other evaluator: pass a
`GeneratedEvaluatorRef(name=..., version=...)` in the `evaluators=`
list and pin the version for reproducible runs.
```python
from agent_framework.foundry import FoundryEvals, GeneratedEvaluatorRef
evals = FoundryEvals(
evaluators=[
GeneratedEvaluatorRef(name="reservation-policy-rubric", version="3"),
"relevance",
"coherence",
],
)
```
Quality gates on rubric output use the standard `EvalResults` helpers,
including `assert_dimension_score_at_least(...)` for per-dimension
thresholds.
See [`evaluate_with_rubric_sample.py`](./evaluate_with_rubric_sample.py)
for a runnable end-to-end example that combines a rubric evaluator with
built-in evaluators and gates a per-dimension threshold.
## Setup
Create a `.env` file with configuration as in the `.env.example` file in this folder.
@@ -44,3 +72,4 @@ Create a `.env` file with configuration as in the `.env.example` file in this fo
- **"I want to test my agent during development"** → `evaluate_agent_sample.py`, Pattern 1
- **"I want to evaluate past agent runs"** → `evaluate_traces_sample.py`
- **"I want to inspect/modify eval data before submitting"** → `evaluate_agent_sample.py`, Pattern 2
- **"I want to score against a custom rubric I created in Foundry"** → `evaluate_with_rubric_sample.py`
@@ -0,0 +1,138 @@
# Copyright (c) Microsoft. All rights reserved.
"""Evaluate a Foundry agent against a rubric evaluator that was created in Foundry.
Rubric evaluators are LLM-as-judge evaluators with custom scoring dimensions
that you define for your domain. agent-framework consumes pre-existing rubric
evaluators — they are authored in the Foundry portal (or via the dedicated
SDK / REST surface) and referenced here by name and version.
See: https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-evaluators/rubric-evaluators
This sample demonstrates:
1. Connecting to a pre-existing Foundry agent (PromptAgent or HostedAgent).
2. Referencing a pre-existing rubric evaluator by ``name`` and ``version``.
3. Mixing the rubric with built-in Foundry evaluators in one run.
4. Asserting per-dimension thresholds with
``EvalResults.assert_dimension_score_at_least(...)`` for CI quality gates.
Starting condition / prerequisites:
- An Azure AI Foundry project with a deployed model.
- A registered Foundry agent (PromptAgent or HostedAgent) in that project.
This is the agent the rubric is meant to evaluate.
- A rubric evaluator already created in the Foundry portal against that
agent. Creating rubrics through the portal currently requires picking a
Foundry agent as the generation context, so this prerequisite is implied
by having a rubric at all.
- Set the following in .env (see ``.env.example``):
- ``FOUNDRY_PROJECT_ENDPOINT``
- ``FOUNDRY_AGENT_NAME`` and ``FOUNDRY_AGENT_VERSION`` for the agent
- ``FOUNDRY_RUBRIC_NAME`` and ``FOUNDRY_RUBRIC_VERSION`` for the rubric
- ``FOUNDRY_MODEL`` for the rubric judge model
"""
import asyncio
import os
from agent_framework import EvalNotPassedError, evaluate_agent
from agent_framework.foundry import FoundryAgent, FoundryChatClient, FoundryEvals, GeneratedEvaluatorRef
from azure.identity import AzureCliCredential
from dotenv import load_dotenv
load_dotenv(override=True)
async def main() -> None:
# 1. Connect to the existing Foundry agent that the rubric was created
# against. PromptAgents and HostedAgents are both supported.
credential = AzureCliCredential()
project_endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
agent = FoundryAgent(
project_endpoint=project_endpoint,
agent_name=os.environ["FOUNDRY_AGENT_NAME"],
agent_version=os.environ.get("FOUNDRY_AGENT_VERSION"),
credential=credential,
)
# 2. Reference the pre-existing rubric evaluator by name + version.
# Always pin a version for reproducible CI runs; versionless refs
# resolve to "latest" and emit a warning at evaluation time.
rubric_name = os.environ["FOUNDRY_RUBRIC_NAME"]
rubric_version = os.environ["FOUNDRY_RUBRIC_VERSION"]
rubric = GeneratedEvaluatorRef(name=rubric_name, version=rubric_version)
# 3. Mix the rubric with built-in evaluators in a single FoundryEvals
# config. FoundryEvals talks to Foundry over the project endpoint, so
# we hand it a FoundryChatClient configured with the same credential.
eval_client = FoundryChatClient(
project_endpoint=project_endpoint,
model=os.environ["FOUNDRY_MODEL"],
credential=credential,
)
evals = FoundryEvals(
client=eval_client,
evaluators=[
rubric,
FoundryEvals.RELEVANCE,
FoundryEvals.COHERENCE,
],
)
# =========================================================================
# Run evaluation
# =========================================================================
print("=" * 60)
print(f"Evaluating '{agent.name}' with rubric '{rubric_name}' (version {rubric_version})")
print("=" * 60)
results = await evaluate_agent(
agent=agent,
queries=[
"What's the weather like in Seattle?",
"Should I bring an umbrella to London tomorrow?",
],
evaluators=evals,
)
for r in results:
print(f"Status: {r.status}")
print(f"Results: {r.passed}/{r.total} passed")
print(f"Portal: {r.report_url}")
if r.all_passed:
print("[PASS] All passed")
else:
print(f"[FAIL] {r.failed} failed")
# =========================================================================
# Per-dimension quality gate
# =========================================================================
# Rubric evaluators emit per-dimension scores (15) on top of the overall
# weighted score. Use assert_dimension_score_at_least to gate CI on a
# specific dimension — e.g., never ship if a critical dimension drops
# below 3.
#
# The dimension_id must match an id defined on your rubric in Foundry.
# ``general_quality`` is used here because it's the conventional
# ``always_applicable: true`` dimension in the Foundry docs' example
# rubric — swap it for whatever dimension id(s) your rubric actually
# defines.
print()
print("=" * 60)
print("Per-dimension quality gate")
print("=" * 60)
for r in results:
try:
r.assert_dimension_score_at_least(
"general_quality",
min_score=3.0,
evaluator=rubric_name,
)
print(f"[PASS] {r.provider}: general_quality >= 3 on every item")
except EvalNotPassedError as exc:
print(f"[FAIL] {r.provider}: dimension gate tripped: {exc}")
if __name__ == "__main__":
asyncio.run(main())