mirror of
https://github.com/microsoft/agent-framework.git
synced 2026-06-16 21:04:09 +08:00
Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)
* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): parse rubric_scores from output items + assertion helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): YAML config loader + sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: fix(evals): address PR review feedback Addresses 4 Copilot review comments on PR #6101: 1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression. 2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path. 3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run. 4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): hosted-agent-aware rubric generation * Auto-detect hosted Foundry agents in agent_as_eval_source: when the agent's chat_client exposes a string agent_name (the convention used by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a type='agent' EvalGenerationSource so the service fetches instructions and tools from the agent registry instead of relying on the local wrapper (which holds neither for hosted agents). * Add hosted_agent_version kwarg and a new agent_version field on EvalGenerationSource so PromptAgent runs can pin to a specific hosted version for reproducible rubric generation. * Add force_prompt_source escape hatch to bypass auto-detection and always emit a rendered prompt dossier - useful when the local wrapper carries overrides the service-side agent doesnt see. * Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=, not dataset_name=/dataset_version=. The mismatch would raise TypeError against the real azure-ai-projects 2.3.0a* SDK; only unmocked integration paths were affected. Tests cover: auto-detection happy path, versionless hosted agent, explicit hosted_agent_version forwarding, force_prompt_source override, non-string chat_client attrs (MagicMock test doubles) not mis-detected, agent_version forwarded through _to_sdk_source, and the corrected dataset SDK kwarg names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): accept canonical dimension_scores key per docs The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): add manual create_rubric_evaluator Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety). Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent. Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): manual rubric sample + namespace re-exports Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run. Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config. Updates the foundry-evals README with a chooser entry for the two rubric paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): remove rubric creation flows; keep consumption only Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK. Removed creation surface area: - FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers. - EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources). - BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline. - Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef. Kept (consumption surface): - GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn. - RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates. - _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key). Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): add evaluate_with_rubric_sample Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): satisfy mypy on _fetch_output_items mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(foundry-evals): hoist repeated local imports to module top Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
Unverified
parent
f36096ce1a
commit
e0d0ad16a0
@@ -1,3 +1,12 @@
|
||||
FOUNDRY_PROJECT_ENDPOINT="<your-project-endpoint>"
|
||||
FOUNDRY_MODEL="<your-model-deployment>"
|
||||
|
||||
# Only needed for evaluate_with_rubric_sample.py — connects to the
|
||||
# pre-existing Foundry agent that the rubric evaluator was created against.
|
||||
FOUNDRY_AGENT_NAME="<your-agent-name>"
|
||||
FOUNDRY_AGENT_VERSION="<your-agent-version>"
|
||||
|
||||
# Only needed for evaluate_with_rubric_sample.py — references a rubric
|
||||
# evaluator you created in Foundry. Pin the version for reproducible runs.
|
||||
FOUNDRY_RUBRIC_NAME="<your-rubric-name>"
|
||||
FOUNDRY_RUBRIC_VERSION="<your-rubric-version>"
|
||||
@@ -35,6 +35,34 @@ Evaluate what already happened — zero changes to agent code:
|
||||
uv run samples/05-end-to-end/evaluation/foundry_evals/evaluate_traces_sample.py
|
||||
```
|
||||
|
||||
### Referencing a rubric evaluator created in Foundry
|
||||
|
||||
Foundry users can create rubric evaluators in the Foundry portal (or
|
||||
through the dedicated SDK / REST surface). Once an evaluator exists,
|
||||
agent-framework consumes it like any other evaluator: pass a
|
||||
`GeneratedEvaluatorRef(name=..., version=...)` in the `evaluators=`
|
||||
list and pin the version for reproducible runs.
|
||||
|
||||
```python
|
||||
from agent_framework.foundry import FoundryEvals, GeneratedEvaluatorRef
|
||||
|
||||
evals = FoundryEvals(
|
||||
evaluators=[
|
||||
GeneratedEvaluatorRef(name="reservation-policy-rubric", version="3"),
|
||||
"relevance",
|
||||
"coherence",
|
||||
],
|
||||
)
|
||||
```
|
||||
|
||||
Quality gates on rubric output use the standard `EvalResults` helpers,
|
||||
including `assert_dimension_score_at_least(...)` for per-dimension
|
||||
thresholds.
|
||||
|
||||
See [`evaluate_with_rubric_sample.py`](./evaluate_with_rubric_sample.py)
|
||||
for a runnable end-to-end example that combines a rubric evaluator with
|
||||
built-in evaluators and gates a per-dimension threshold.
|
||||
|
||||
## Setup
|
||||
|
||||
Create a `.env` file with configuration as in the `.env.example` file in this folder.
|
||||
@@ -44,3 +72,4 @@ Create a `.env` file with configuration as in the `.env.example` file in this fo
|
||||
- **"I want to test my agent during development"** → `evaluate_agent_sample.py`, Pattern 1
|
||||
- **"I want to evaluate past agent runs"** → `evaluate_traces_sample.py`
|
||||
- **"I want to inspect/modify eval data before submitting"** → `evaluate_agent_sample.py`, Pattern 2
|
||||
- **"I want to score against a custom rubric I created in Foundry"** → `evaluate_with_rubric_sample.py`
|
||||
|
||||
@@ -0,0 +1,138 @@
|
||||
# Copyright (c) Microsoft. All rights reserved.
|
||||
|
||||
"""Evaluate a Foundry agent against a rubric evaluator that was created in Foundry.
|
||||
|
||||
Rubric evaluators are LLM-as-judge evaluators with custom scoring dimensions
|
||||
that you define for your domain. agent-framework consumes pre-existing rubric
|
||||
evaluators — they are authored in the Foundry portal (or via the dedicated
|
||||
SDK / REST surface) and referenced here by name and version.
|
||||
|
||||
See: https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-evaluators/rubric-evaluators
|
||||
|
||||
This sample demonstrates:
|
||||
1. Connecting to a pre-existing Foundry agent (PromptAgent or HostedAgent).
|
||||
2. Referencing a pre-existing rubric evaluator by ``name`` and ``version``.
|
||||
3. Mixing the rubric with built-in Foundry evaluators in one run.
|
||||
4. Asserting per-dimension thresholds with
|
||||
``EvalResults.assert_dimension_score_at_least(...)`` for CI quality gates.
|
||||
|
||||
Starting condition / prerequisites:
|
||||
- An Azure AI Foundry project with a deployed model.
|
||||
- A registered Foundry agent (PromptAgent or HostedAgent) in that project.
|
||||
This is the agent the rubric is meant to evaluate.
|
||||
- A rubric evaluator already created in the Foundry portal against that
|
||||
agent. Creating rubrics through the portal currently requires picking a
|
||||
Foundry agent as the generation context, so this prerequisite is implied
|
||||
by having a rubric at all.
|
||||
- Set the following in .env (see ``.env.example``):
|
||||
- ``FOUNDRY_PROJECT_ENDPOINT``
|
||||
- ``FOUNDRY_AGENT_NAME`` and ``FOUNDRY_AGENT_VERSION`` for the agent
|
||||
- ``FOUNDRY_RUBRIC_NAME`` and ``FOUNDRY_RUBRIC_VERSION`` for the rubric
|
||||
- ``FOUNDRY_MODEL`` for the rubric judge model
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
|
||||
from agent_framework import EvalNotPassedError, evaluate_agent
|
||||
from agent_framework.foundry import FoundryAgent, FoundryChatClient, FoundryEvals, GeneratedEvaluatorRef
|
||||
from azure.identity import AzureCliCredential
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv(override=True)
|
||||
|
||||
|
||||
async def main() -> None:
|
||||
# 1. Connect to the existing Foundry agent that the rubric was created
|
||||
# against. PromptAgents and HostedAgents are both supported.
|
||||
credential = AzureCliCredential()
|
||||
project_endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]
|
||||
|
||||
agent = FoundryAgent(
|
||||
project_endpoint=project_endpoint,
|
||||
agent_name=os.environ["FOUNDRY_AGENT_NAME"],
|
||||
agent_version=os.environ.get("FOUNDRY_AGENT_VERSION"),
|
||||
credential=credential,
|
||||
)
|
||||
|
||||
# 2. Reference the pre-existing rubric evaluator by name + version.
|
||||
# Always pin a version for reproducible CI runs; versionless refs
|
||||
# resolve to "latest" and emit a warning at evaluation time.
|
||||
rubric_name = os.environ["FOUNDRY_RUBRIC_NAME"]
|
||||
rubric_version = os.environ["FOUNDRY_RUBRIC_VERSION"]
|
||||
rubric = GeneratedEvaluatorRef(name=rubric_name, version=rubric_version)
|
||||
|
||||
# 3. Mix the rubric with built-in evaluators in a single FoundryEvals
|
||||
# config. FoundryEvals talks to Foundry over the project endpoint, so
|
||||
# we hand it a FoundryChatClient configured with the same credential.
|
||||
eval_client = FoundryChatClient(
|
||||
project_endpoint=project_endpoint,
|
||||
model=os.environ["FOUNDRY_MODEL"],
|
||||
credential=credential,
|
||||
)
|
||||
evals = FoundryEvals(
|
||||
client=eval_client,
|
||||
evaluators=[
|
||||
rubric,
|
||||
FoundryEvals.RELEVANCE,
|
||||
FoundryEvals.COHERENCE,
|
||||
],
|
||||
)
|
||||
|
||||
# =========================================================================
|
||||
# Run evaluation
|
||||
# =========================================================================
|
||||
print("=" * 60)
|
||||
print(f"Evaluating '{agent.name}' with rubric '{rubric_name}' (version {rubric_version})")
|
||||
print("=" * 60)
|
||||
|
||||
results = await evaluate_agent(
|
||||
agent=agent,
|
||||
queries=[
|
||||
"What's the weather like in Seattle?",
|
||||
"Should I bring an umbrella to London tomorrow?",
|
||||
],
|
||||
evaluators=evals,
|
||||
)
|
||||
|
||||
for r in results:
|
||||
print(f"Status: {r.status}")
|
||||
print(f"Results: {r.passed}/{r.total} passed")
|
||||
print(f"Portal: {r.report_url}")
|
||||
if r.all_passed:
|
||||
print("[PASS] All passed")
|
||||
else:
|
||||
print(f"[FAIL] {r.failed} failed")
|
||||
|
||||
# =========================================================================
|
||||
# Per-dimension quality gate
|
||||
# =========================================================================
|
||||
# Rubric evaluators emit per-dimension scores (1–5) on top of the overall
|
||||
# weighted score. Use assert_dimension_score_at_least to gate CI on a
|
||||
# specific dimension — e.g., never ship if a critical dimension drops
|
||||
# below 3.
|
||||
#
|
||||
# The dimension_id must match an id defined on your rubric in Foundry.
|
||||
# ``general_quality`` is used here because it's the conventional
|
||||
# ``always_applicable: true`` dimension in the Foundry docs' example
|
||||
# rubric — swap it for whatever dimension id(s) your rubric actually
|
||||
# defines.
|
||||
print()
|
||||
print("=" * 60)
|
||||
print("Per-dimension quality gate")
|
||||
print("=" * 60)
|
||||
|
||||
for r in results:
|
||||
try:
|
||||
r.assert_dimension_score_at_least(
|
||||
"general_quality",
|
||||
min_score=3.0,
|
||||
evaluator=rubric_name,
|
||||
)
|
||||
print(f"[PASS] {r.provider}: general_quality >= 3 on every item")
|
||||
except EvalNotPassedError as exc:
|
||||
print(f"[FAIL] {r.provider}: dimension gate tripped: {exc}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
Reference in New Issue
Block a user