mirror of
https://github.com/microsoft/agent-framework.git
synced 2026-06-16 21:04:09 +08:00
Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)
* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators= Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): parse rubric_scores from output items + assertion helpers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): YAML config loader + sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: fix(evals): address PR review feedback Addresses 4 Copilot review comments on PR #6101: 1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression. 2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path. 3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run. 4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Python: feat(foundry-evals): hosted-agent-aware rubric generation * Auto-detect hosted Foundry agents in agent_as_eval_source: when the agent's chat_client exposes a string agent_name (the convention used by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a type='agent' EvalGenerationSource so the service fetches instructions and tools from the agent registry instead of relying on the local wrapper (which holds neither for hosted agents). * Add hosted_agent_version kwarg and a new agent_version field on EvalGenerationSource so PromptAgent runs can pin to a specific hosted version for reproducible rubric generation. * Add force_prompt_source escape hatch to bypass auto-detection and always emit a rendered prompt dossier - useful when the local wrapper carries overrides the service-side agent doesnt see. * Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=, not dataset_name=/dataset_version=. The mismatch would raise TypeError against the real azure-ai-projects 2.3.0a* SDK; only unmocked integration paths were affected. Tests cover: auto-detection happy path, versionless hosted agent, explicit hosted_agent_version forwarding, force_prompt_source override, non-string chat_client attrs (MagicMock test doubles) not mis-detected, agent_version forwarded through _to_sdk_source, and the corrected dataset SDK kwarg names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): accept canonical dimension_scores key per docs The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): add manual create_rubric_evaluator Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety). Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent. Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): manual rubric sample + namespace re-exports Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run. Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config. Updates the foundry-evals README with a chooser entry for the two rubric paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(foundry-evals): remove rubric creation flows; keep consumption only Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK. Removed creation surface area: - FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers. - EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources). - BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline. - Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef. Kept (consumption surface): - GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn. - RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates. - _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key). Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * samples(foundry-evals): add evaluate_with_rubric_sample Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(foundry-evals): satisfy mypy on _fetch_output_items mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * test(foundry-evals): hoist repeated local imports to module top Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
Unverified
parent
f36096ce1a
commit
e0d0ad16a0
@@ -71,6 +71,7 @@ from ._evaluation import (
|
||||
Evaluator,
|
||||
ExpectedToolCall,
|
||||
LocalEvaluator,
|
||||
RubricScore,
|
||||
evaluate_agent,
|
||||
evaluate_workflow,
|
||||
evaluator,
|
||||
@@ -460,6 +461,7 @@ __all__ = [
|
||||
"ResponseStream",
|
||||
"Role",
|
||||
"RoleLiteral",
|
||||
"RubricScore",
|
||||
"RunContext",
|
||||
"Runner",
|
||||
"RunnerContext",
|
||||
|
||||
@@ -311,12 +311,15 @@ class EvalScoreResult:
|
||||
score: Numeric score from the evaluator.
|
||||
passed: Whether the item passed this evaluator's threshold.
|
||||
sample: Optional raw evaluator output (rationale, metadata).
|
||||
dimensions: Per-dimension scores when this evaluator is a rubric
|
||||
evaluator. ``None`` for non-rubric (e.g. built-in) evaluators.
|
||||
"""
|
||||
|
||||
name: str
|
||||
score: float
|
||||
passed: bool | None = None
|
||||
sample: dict[str, Any] | None = None
|
||||
dimensions: list[RubricScore] | None = None
|
||||
|
||||
|
||||
@experimental(feature_id=ExperimentalFeature.EVALS)
|
||||
@@ -496,6 +499,179 @@ class EvalResults:
|
||||
detail += f" Errored items: {', '.join(summaries)}."
|
||||
raise EvalNotPassedError(detail)
|
||||
|
||||
def assert_score_at_least(
|
||||
self,
|
||||
min_score: float,
|
||||
*,
|
||||
evaluator: str | None = None,
|
||||
msg: str | None = None,
|
||||
) -> None:
|
||||
"""Assert every item's score (optionally filtered by evaluator) is ``>= min_score``.
|
||||
|
||||
Designed for CI gates on generated rubric evaluators (e.g.
|
||||
``results.assert_score_at_least(0.80)``). Includes any
|
||||
sub-results from workflow evaluations.
|
||||
|
||||
Args:
|
||||
min_score: Minimum acceptable score (inclusive).
|
||||
evaluator: When set, only check scores from the evaluator
|
||||
whose ``EvalScoreResult.name`` matches.
|
||||
msg: Optional custom failure message.
|
||||
|
||||
Raises:
|
||||
EvalNotPassedError: When any matching score is below the threshold.
|
||||
"""
|
||||
offenders: list[str] = []
|
||||
|
||||
def _check(results: EvalResults) -> None:
|
||||
for item in results.items:
|
||||
for score in item.scores:
|
||||
if evaluator is not None and score.name != evaluator:
|
||||
continue
|
||||
if score.score < min_score:
|
||||
offenders.append(f"{item.item_id}/{score.name}={score.score:.3f}")
|
||||
for sub in results.sub_results.values():
|
||||
_check(sub)
|
||||
|
||||
_check(self)
|
||||
if offenders:
|
||||
detail = msg or (
|
||||
f"{len(offenders)} score(s) below threshold {min_score}"
|
||||
f"{' for ' + evaluator if evaluator else ''}: {', '.join(offenders[:5])}"
|
||||
+ (f" (+{len(offenders) - 5} more)" if len(offenders) > 5 else "")
|
||||
)
|
||||
raise EvalNotPassedError(detail)
|
||||
|
||||
def assert_dimension_score_at_least(
|
||||
self,
|
||||
dimension_id: str,
|
||||
min_score: float,
|
||||
*,
|
||||
evaluator: str | None = None,
|
||||
require_applicable: bool = False,
|
||||
msg: str | None = None,
|
||||
) -> None:
|
||||
"""Assert every item's score for a rubric *dimension* is ``>= min_score``.
|
||||
|
||||
Walks ``EvalScoreResult.dimensions`` looking for the named
|
||||
dimension across all items (and sub-results). Non-applicable
|
||||
dimensions are skipped by default; pass
|
||||
``require_applicable=True`` to fail when no applicable score is
|
||||
produced.
|
||||
|
||||
Args:
|
||||
dimension_id: Dimension id (matches the rubric definition).
|
||||
min_score: Minimum acceptable dimension score (inclusive).
|
||||
evaluator: When set, only consider scores from the evaluator
|
||||
whose ``EvalScoreResult.name`` matches.
|
||||
require_applicable: When ``True``, missing or non-applicable
|
||||
dimension scores raise. Defaults to ``False`` (skip).
|
||||
msg: Optional custom failure message.
|
||||
|
||||
Raises:
|
||||
EvalNotPassedError: When the dimension fails the threshold.
|
||||
"""
|
||||
offenders: list[str] = []
|
||||
missing_items: list[str] = []
|
||||
|
||||
def _check(results: EvalResults) -> None:
|
||||
for item in results.items:
|
||||
found_applicable = False
|
||||
for score in item.scores:
|
||||
if evaluator is not None and score.name != evaluator:
|
||||
continue
|
||||
if not score.dimensions:
|
||||
continue
|
||||
for rs in score.dimensions:
|
||||
if rs.id != dimension_id:
|
||||
continue
|
||||
if not rs.applicable:
|
||||
continue
|
||||
found_applicable = True
|
||||
if rs.score is None or rs.score < min_score:
|
||||
offenders.append(
|
||||
f"{item.item_id}/{score.name}/{dimension_id}="
|
||||
f"{rs.score if rs.score is not None else 'None'}"
|
||||
)
|
||||
if require_applicable and not found_applicable:
|
||||
missing_items.append(item.item_id)
|
||||
for sub in results.sub_results.values():
|
||||
_check(sub)
|
||||
|
||||
_check(self)
|
||||
problems: list[str] = []
|
||||
if offenders:
|
||||
problems.append(
|
||||
f"{len(offenders)} dimension score(s) for '{dimension_id}' below {min_score}: "
|
||||
f"{', '.join(offenders[:5])}" + (f" (+{len(offenders) - 5} more)" if len(offenders) > 5 else "")
|
||||
)
|
||||
if missing_items:
|
||||
problems.append(
|
||||
f"Dimension '{dimension_id}' not applicable on {len(missing_items)} item(s): "
|
||||
f"{', '.join(missing_items[:5])}"
|
||||
)
|
||||
if problems:
|
||||
raise EvalNotPassedError(msg or "; ".join(problems))
|
||||
|
||||
def assert_no_failed_items(self, msg: str | None = None) -> None:
|
||||
"""Assert no item ended in ``fail`` or ``error`` status.
|
||||
|
||||
Includes any sub-results from workflow evaluations.
|
||||
|
||||
Args:
|
||||
msg: Optional custom failure message.
|
||||
|
||||
Raises:
|
||||
EvalNotPassedError: When any item failed or errored.
|
||||
"""
|
||||
bad: list[str] = []
|
||||
|
||||
def _check(results: EvalResults) -> None:
|
||||
for item in results.items:
|
||||
if item.is_failed or item.is_error:
|
||||
bad.append(f"{item.item_id}:{item.status}")
|
||||
for sub in results.sub_results.values():
|
||||
_check(sub)
|
||||
|
||||
_check(self)
|
||||
if bad:
|
||||
detail = msg or (
|
||||
f"{len(bad)} item(s) failed or errored: {', '.join(bad[:5])}"
|
||||
+ (f" (+{len(bad) - 5} more)" if len(bad) > 5 else "")
|
||||
)
|
||||
raise EvalNotPassedError(detail)
|
||||
|
||||
|
||||
# endregion
|
||||
|
||||
# region Generated rubric evaluators
|
||||
|
||||
|
||||
@experimental(feature_id=ExperimentalFeature.EVALS)
|
||||
@dataclass(frozen=True)
|
||||
class RubricScore:
|
||||
"""A single dimension's score from a rubric-based evaluator run.
|
||||
|
||||
Rubric evaluators emit one ``RubricScore`` per dimension per item.
|
||||
Attached to :class:`EvalScoreResult` as a typed view of the raw
|
||||
``properties.rubric_scores`` payload returned by providers such as
|
||||
Foundry's generated rubric evaluators.
|
||||
|
||||
Attributes:
|
||||
id: Dimension id (matches the rubric definition).
|
||||
score: Numeric score, or ``None`` when the dimension was marked
|
||||
non-applicable for this item.
|
||||
applicable: Whether the dimension applied to this item.
|
||||
weight: Dimension weight (mirrors the rubric definition).
|
||||
reason: Short rationale produced by the evaluator.
|
||||
"""
|
||||
|
||||
id: str
|
||||
score: int | None
|
||||
applicable: bool
|
||||
weight: int
|
||||
reason: str
|
||||
|
||||
|
||||
# endregion
|
||||
|
||||
|
||||
@@ -34,6 +34,7 @@ _IMPORTS: dict[str, tuple[str, str]] = {
|
||||
"FoundryLocalChatOptions": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
|
||||
"FoundryLocalClient": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
|
||||
"FoundryLocalSettings": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
|
||||
"GeneratedEvaluatorRef": ("agent_framework_foundry", "agent-framework-foundry"),
|
||||
"RawAnthropicFoundryClient": ("agent_framework_anthropic", "agent-framework-anthropic"),
|
||||
"RawFoundryAgent": ("agent_framework_foundry", "agent-framework-foundry"),
|
||||
"RawFoundryAgentChatClient": ("agent_framework_foundry", "agent-framework-foundry"),
|
||||
|
||||
@@ -20,6 +20,7 @@ from agent_framework_foundry import (
|
||||
FoundryEmbeddingSettings,
|
||||
FoundryEvals,
|
||||
FoundryMemoryProvider,
|
||||
GeneratedEvaluatorRef,
|
||||
RawFoundryAgent,
|
||||
RawFoundryAgentChatClient,
|
||||
RawFoundryChatClient,
|
||||
@@ -52,6 +53,7 @@ __all__ = [
|
||||
"FoundryLocalClient",
|
||||
"FoundryLocalSettings",
|
||||
"FoundryMemoryProvider",
|
||||
"GeneratedEvaluatorRef",
|
||||
"RawAnthropicFoundryClient",
|
||||
"RawFoundryAgent",
|
||||
"RawFoundryAgentChatClient",
|
||||
|
||||
@@ -11,8 +11,13 @@ import pytest
|
||||
from agent_framework._evaluation import (
|
||||
CheckResult,
|
||||
EvalItem,
|
||||
EvalItemResult,
|
||||
EvalNotPassedError,
|
||||
EvalResults,
|
||||
EvalScoreResult,
|
||||
ExpectedToolCall,
|
||||
LocalEvaluator,
|
||||
RubricScore,
|
||||
_coerce_result,
|
||||
evaluator,
|
||||
keyword_check,
|
||||
@@ -1010,19 +1015,101 @@ class TestAllPassedSubResults:
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# r5 review: _build_overall_item with empty outputs
|
||||
# Rubric assertions (EvalResults.assert_*)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestBuildOverallItemEmpty:
|
||||
"""Test _build_overall_item returns None for empty workflow outputs."""
|
||||
def _rubric_results(*scores_per_item: list[EvalScoreResult]) -> EvalResults:
|
||||
items = [
|
||||
EvalItemResult(item_id=f"item-{i}", status="pass", scores=scores) for i, scores in enumerate(scores_per_item)
|
||||
]
|
||||
return EvalResults(
|
||||
provider="test",
|
||||
eval_id="ev1",
|
||||
run_id="run1",
|
||||
result_counts={"passed": len(items), "failed": 0, "errored": 0, "total": len(items)},
|
||||
items=items,
|
||||
)
|
||||
|
||||
def test_returns_none_for_empty_outputs(self):
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
from agent_framework._evaluation import _build_overall_item
|
||||
class TestRubricAssertions:
|
||||
"""Tests for EvalResults.assert_dimension_score_at_least."""
|
||||
|
||||
mock_result = MagicMock()
|
||||
mock_result.get_outputs.return_value = []
|
||||
item = _build_overall_item("Hello", mock_result)
|
||||
assert item is None
|
||||
def test_dimension_at_or_above_threshold_passes(self) -> None:
|
||||
results = _rubric_results(
|
||||
[
|
||||
EvalScoreResult(
|
||||
name="policy",
|
||||
score=0.9,
|
||||
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
|
||||
)
|
||||
],
|
||||
)
|
||||
# Should not raise.
|
||||
results.assert_dimension_score_at_least("clarity", 3)
|
||||
|
||||
def test_dimension_below_threshold_raises(self) -> None:
|
||||
results = _rubric_results(
|
||||
[
|
||||
EvalScoreResult(
|
||||
name="policy",
|
||||
score=0.5,
|
||||
dimensions=[RubricScore(id="clarity", score=2, applicable=True, weight=1, reason="")],
|
||||
)
|
||||
],
|
||||
)
|
||||
with pytest.raises(EvalNotPassedError):
|
||||
results.assert_dimension_score_at_least("clarity", 3)
|
||||
|
||||
def test_non_applicable_skipped_by_default(self) -> None:
|
||||
results = _rubric_results(
|
||||
[
|
||||
EvalScoreResult(
|
||||
name="policy",
|
||||
score=1.0,
|
||||
dimensions=[RubricScore(id="clarity", score=None, applicable=False, weight=1, reason="n/a")],
|
||||
)
|
||||
],
|
||||
)
|
||||
# No applicable scores; default behaviour is to skip silently.
|
||||
results.assert_dimension_score_at_least("clarity", 3)
|
||||
|
||||
def test_require_applicable_raises_when_dimension_absent(self) -> None:
|
||||
results = _rubric_results(
|
||||
[EvalScoreResult(name="policy", score=1.0, dimensions=[])],
|
||||
)
|
||||
with pytest.raises(EvalNotPassedError, match="not applicable"):
|
||||
results.assert_dimension_score_at_least("clarity", 3, require_applicable=True)
|
||||
|
||||
def test_require_applicable_raises_when_filtered_evaluator_missing(self) -> None:
|
||||
# Regression: previously the (not evaluator or found_any) guard caused
|
||||
# this case to silently pass even with require_applicable=True.
|
||||
results = _rubric_results(
|
||||
[
|
||||
EvalScoreResult(
|
||||
name="other",
|
||||
score=0.9,
|
||||
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
|
||||
)
|
||||
],
|
||||
)
|
||||
with pytest.raises(EvalNotPassedError, match="not applicable"):
|
||||
results.assert_dimension_score_at_least("clarity", 3, evaluator="policy", require_applicable=True)
|
||||
|
||||
def test_evaluator_filter_isolates_offenders(self) -> None:
|
||||
results = _rubric_results(
|
||||
[
|
||||
EvalScoreResult(
|
||||
name="other",
|
||||
score=0.1,
|
||||
dimensions=[RubricScore(id="clarity", score=1, applicable=True, weight=1, reason="")],
|
||||
),
|
||||
EvalScoreResult(
|
||||
name="policy",
|
||||
score=0.9,
|
||||
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
|
||||
),
|
||||
],
|
||||
)
|
||||
# The low-scoring "other" evaluator is filtered out; "policy" passes.
|
||||
results.assert_dimension_score_at_least("clarity", 3, evaluator="policy")
|
||||
|
||||
@@ -12,6 +12,7 @@ from ._embedding_client import (
|
||||
)
|
||||
from ._foundry_evals import (
|
||||
FoundryEvals,
|
||||
GeneratedEvaluatorRef,
|
||||
evaluate_foundry_target,
|
||||
evaluate_traces,
|
||||
)
|
||||
@@ -33,6 +34,7 @@ __all__ = [
|
||||
"FoundryEmbeddingSettings",
|
||||
"FoundryEvals",
|
||||
"FoundryMemoryProvider",
|
||||
"GeneratedEvaluatorRef",
|
||||
"RawFoundryAgent",
|
||||
"RawFoundryAgentChatClient",
|
||||
"RawFoundryChatClient",
|
||||
|
||||
@@ -28,8 +28,9 @@ from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
from collections.abc import Sequence
|
||||
from typing import TYPE_CHECKING, Any
|
||||
from collections.abc import Iterable, Sequence
|
||||
from dataclasses import dataclass
|
||||
from typing import TYPE_CHECKING, Any, cast
|
||||
|
||||
from agent_framework._evaluation import (
|
||||
AgentEvalConverter,
|
||||
@@ -39,6 +40,7 @@ from agent_framework._evaluation import (
|
||||
EvalItemResult,
|
||||
EvalResults,
|
||||
EvalScoreResult,
|
||||
RubricScore,
|
||||
)
|
||||
from agent_framework._feature_stage import ExperimentalFeature, experimental
|
||||
from openai import AsyncOpenAI
|
||||
@@ -51,6 +53,54 @@ if TYPE_CHECKING:
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# region Generated rubric evaluator references
|
||||
|
||||
|
||||
@experimental(feature_id=ExperimentalFeature.EVALS)
|
||||
@dataclass(frozen=True)
|
||||
class GeneratedEvaluatorRef:
|
||||
"""A reference to a rubric evaluator that already exists in Foundry.
|
||||
|
||||
Pass instances of this class to :class:`FoundryEvals` to score items
|
||||
with a pre-existing rubric evaluator (manually authored or
|
||||
auto-generated through the Foundry portal). agent-framework is a
|
||||
consumer here: it does not create or modify the evaluator definition;
|
||||
it only references the persisted version by name.
|
||||
|
||||
Pinning ``version`` is strongly recommended so evaluation runs are
|
||||
reproducible. ``version=None`` resolves to whichever version is
|
||||
current at execution time; :class:`FoundryEvals` emits a warning when
|
||||
a versionless reference is used. CI gates should always pass a
|
||||
concrete version.
|
||||
|
||||
Attributes:
|
||||
name: Evaluator name as stored in the Foundry project (for
|
||||
example ``"reservation-policy-rubric"``). Distinct from
|
||||
built-in evaluators such as ``"builtin.relevance"``.
|
||||
version: Pinned evaluator version. ``None`` means "latest" —
|
||||
this is discouraged for CI/repro and :class:`FoundryEvals`
|
||||
will emit a warning when used.
|
||||
display_name: Optional human-readable name used in result
|
||||
summaries. Defaults to ``name`` when unset.
|
||||
"""
|
||||
|
||||
name: str
|
||||
version: str | None = None
|
||||
display_name: str | None = None
|
||||
|
||||
@classmethod
|
||||
def latest(cls, name: str, *, display_name: str | None = None) -> GeneratedEvaluatorRef:
|
||||
"""Construct a versionless reference (resolves to the latest version at run time).
|
||||
|
||||
Discouraged for reproducible runs. Prefer the constructor with
|
||||
an explicit ``version`` so CI and replay evaluations stay stable
|
||||
when the evaluator is updated in Foundry.
|
||||
"""
|
||||
return cls(name=name, version=None, display_name=display_name)
|
||||
|
||||
|
||||
# endregion
|
||||
# Agent evaluators that accept query/response as conversation arrays.
|
||||
# Maintained manually — check https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk
|
||||
# for the latest evaluator list. These are the evaluators that need conversation-format input.
|
||||
@@ -166,7 +216,7 @@ def _resolve_evaluator(name: str) -> str:
|
||||
|
||||
|
||||
def _build_testing_criteria(
|
||||
evaluators: Sequence[str],
|
||||
evaluators: Sequence[str | GeneratedEvaluatorRef],
|
||||
model: str,
|
||||
*,
|
||||
include_data_mapping: bool = False,
|
||||
@@ -175,7 +225,9 @@ def _build_testing_criteria(
|
||||
"""Build ``testing_criteria`` for ``evals.create()``.
|
||||
|
||||
Args:
|
||||
evaluators: Evaluator names.
|
||||
evaluators: Evaluator names (built-in shorts / fully-qualified
|
||||
``builtin.*`` names) or :class:`GeneratedEvaluatorRef`
|
||||
instances for generated rubric evaluators.
|
||||
model: Model deployment for the LLM judge.
|
||||
include_data_mapping: Whether to include field-level data mapping
|
||||
(required for the JSONL data source, not needed for response-based).
|
||||
@@ -183,7 +235,38 @@ def _build_testing_criteria(
|
||||
definitions.
|
||||
"""
|
||||
criteria: list[dict[str, Any]] = []
|
||||
for name in evaluators:
|
||||
for entry_spec in evaluators:
|
||||
if isinstance(entry_spec, GeneratedEvaluatorRef):
|
||||
short = entry_spec.display_name or entry_spec.name
|
||||
ref_entry: dict[str, Any] = {
|
||||
"type": "azure_ai_evaluator",
|
||||
"name": short,
|
||||
"evaluator_name": entry_spec.name,
|
||||
"initialization_parameters": {"deployment_name": model},
|
||||
}
|
||||
if entry_spec.version is not None:
|
||||
ref_entry["evaluator_version"] = entry_spec.version
|
||||
else:
|
||||
logger.warning(
|
||||
"GeneratedEvaluatorRef '%s' has no pinned version; the eval run "
|
||||
"will resolve to whichever version is current at execution time. "
|
||||
"Pin the version for reproducible runs.",
|
||||
entry_spec.name,
|
||||
)
|
||||
if include_data_mapping:
|
||||
# Rubric evaluators accept conversation arrays like agent
|
||||
# evaluators, plus tool_definitions when items are tool-aware.
|
||||
ref_mapping: dict[str, str] = {
|
||||
"query": "{{item.query_messages}}",
|
||||
"response": "{{item.response_messages}}",
|
||||
}
|
||||
if include_tool_definitions:
|
||||
ref_mapping["tool_definitions"] = "{{item.tool_definitions}}"
|
||||
ref_entry["data_mapping"] = ref_mapping
|
||||
criteria.append(ref_entry)
|
||||
continue
|
||||
|
||||
name = entry_spec
|
||||
qualified = _resolve_evaluator(name)
|
||||
short = name if not name.startswith("builtin.") else name.split(".")[-1]
|
||||
|
||||
@@ -247,9 +330,9 @@ def _build_item_schema(
|
||||
|
||||
|
||||
def _resolve_default_evaluators(
|
||||
evaluators: Sequence[str] | None,
|
||||
evaluators: Sequence[str | GeneratedEvaluatorRef] | None,
|
||||
items: Sequence[EvalItem | dict[str, Any]] | None = None,
|
||||
) -> list[str]:
|
||||
) -> list[str | GeneratedEvaluatorRef]:
|
||||
"""Resolve evaluators, applying defaults when ``None``.
|
||||
|
||||
Defaults to relevance + coherence + task_adherence. Automatically adds
|
||||
@@ -258,7 +341,7 @@ def _resolve_default_evaluators(
|
||||
if evaluators is not None:
|
||||
return list(evaluators)
|
||||
|
||||
result = list(_DEFAULT_EVALUATORS)
|
||||
result: list[str | GeneratedEvaluatorRef] = list(_DEFAULT_EVALUATORS)
|
||||
if items is not None:
|
||||
has_tools = any((item.tools if isinstance(item, EvalItem) else item.get("tool_definitions")) for item in items)
|
||||
if has_tools:
|
||||
@@ -267,14 +350,24 @@ def _resolve_default_evaluators(
|
||||
|
||||
|
||||
def _filter_tool_evaluators(
|
||||
evaluators: list[str],
|
||||
evaluators: list[str | GeneratedEvaluatorRef],
|
||||
items: Sequence[EvalItem | dict[str, Any]],
|
||||
) -> list[str]:
|
||||
"""Remove tool evaluators if no items have tool definitions."""
|
||||
) -> list[str | GeneratedEvaluatorRef]:
|
||||
"""Remove tool evaluators if no items have tool definitions.
|
||||
|
||||
Generated rubric evaluators are tool-aware but not tool-required; they
|
||||
are preserved regardless of whether items carry tool definitions.
|
||||
"""
|
||||
has_tools = any((item.tools if isinstance(item, EvalItem) else item.get("tool_definitions")) for item in items)
|
||||
if has_tools:
|
||||
return evaluators
|
||||
filtered = [e for e in evaluators if _resolve_evaluator(e) not in _TOOL_EVALUATORS]
|
||||
|
||||
def _is_tool_only(spec: str | GeneratedEvaluatorRef) -> bool:
|
||||
if isinstance(spec, GeneratedEvaluatorRef):
|
||||
return False
|
||||
return _resolve_evaluator(spec) in _TOOL_EVALUATORS
|
||||
|
||||
filtered = [e for e in evaluators if not _is_tool_only(e)]
|
||||
if not filtered:
|
||||
raise ValueError(
|
||||
f"All requested evaluators {evaluators} require tool definitions, "
|
||||
@@ -282,7 +375,7 @@ def _filter_tool_evaluators(
|
||||
"or choose evaluators that do not require tools."
|
||||
)
|
||||
if len(filtered) < len(evaluators):
|
||||
removed = [e for e in evaluators if _resolve_evaluator(e) in _TOOL_EVALUATORS]
|
||||
removed = [e for e in evaluators if _is_tool_only(e)]
|
||||
logger.info("Removed tool evaluators %s (no items have tools)", removed)
|
||||
return filtered
|
||||
|
||||
@@ -354,6 +447,114 @@ def _extract_per_evaluator(run: RunRetrieveResponse) -> dict[str, dict[str, int]
|
||||
return per_eval
|
||||
|
||||
|
||||
_RUBRIC_DIMENSION_KEYS: tuple[str, ...] = ("dimension_scores", "rubric_scores")
|
||||
"""Property keys that may carry per-dimension rubric breakdowns.
|
||||
|
||||
The published Foundry rubric-evaluator output format uses
|
||||
``properties.dimension_scores`` (see the Microsoft Learn "Rubric
|
||||
evaluators" reference). Earlier preview builds and some SDK shapes
|
||||
used ``rubric_scores``; we accept both for defensive forward/backward
|
||||
compatibility.
|
||||
"""
|
||||
|
||||
|
||||
def _parse_dimension_entries(raw: Any) -> list[RubricScore]:
|
||||
"""Parse a raw list-like payload into ``RubricScore`` instances.
|
||||
|
||||
Returns an empty list when ``raw`` is falsy, not iterable, or
|
||||
contains no well-formed entries.
|
||||
"""
|
||||
if not raw:
|
||||
return []
|
||||
try:
|
||||
raw_iter: Iterable[Any] = iter(raw)
|
||||
except TypeError:
|
||||
return []
|
||||
|
||||
parsed: list[RubricScore] = []
|
||||
for raw_entry in raw_iter:
|
||||
entry: Any = raw_entry
|
||||
try:
|
||||
rid: Any
|
||||
score_val: Any
|
||||
applicable: Any
|
||||
weight: Any
|
||||
reason: Any
|
||||
if isinstance(entry, dict):
|
||||
entry_any = cast("dict[str, Any]", entry)
|
||||
rid = entry_any.get("id")
|
||||
score_val = entry_any.get("score")
|
||||
applicable = entry_any.get("applicable")
|
||||
weight = entry_any.get("weight")
|
||||
reason = entry_any.get("reason", "")
|
||||
else:
|
||||
rid = getattr(entry, "id", None)
|
||||
score_val = getattr(entry, "score", None)
|
||||
applicable = getattr(entry, "applicable", None)
|
||||
weight = getattr(entry, "weight", None)
|
||||
reason = getattr(entry, "reason", "") or ""
|
||||
if rid is None or weight is None or applicable is None:
|
||||
continue
|
||||
parsed.append(
|
||||
RubricScore(
|
||||
id=str(rid),
|
||||
score=int(score_val) if isinstance(score_val, (int, float)) else None,
|
||||
applicable=bool(applicable),
|
||||
weight=int(weight),
|
||||
reason=str(reason) if reason is not None else "",
|
||||
)
|
||||
)
|
||||
except (TypeError, ValueError):
|
||||
logger.debug("Skipping malformed rubric dimension entry: %s", cast("Any", entry), exc_info=True)
|
||||
return parsed
|
||||
|
||||
|
||||
def _extract_rubric_scores(sample: Any) -> list[RubricScore] | None:
|
||||
"""Extract typed ``RubricScore`` instances from an evaluator's raw sample payload.
|
||||
|
||||
Foundry rubric evaluators include a per-dimension breakdown under
|
||||
``properties.dimension_scores`` on each result (preview builds used
|
||||
``rubric_scores``; both keys are accepted, with the canonical
|
||||
``dimension_scores`` taking priority). The exact location may
|
||||
vary across SDK versions, so this helper accepts a few shapes:
|
||||
|
||||
* The SDK ``sample`` object exposes
|
||||
``properties.dimension_scores`` / ``properties.rubric_scores``.
|
||||
* The ``sample`` is a dict containing the same under
|
||||
``properties.<key>``.
|
||||
* The ``sample`` is a dict with ``dimension_scores`` /
|
||||
``rubric_scores`` at the top level.
|
||||
|
||||
Returns ``None`` when no rubric scores are present (i.e. the
|
||||
evaluator was not a rubric evaluator).
|
||||
"""
|
||||
if sample is None:
|
||||
return None
|
||||
|
||||
containers: list[Any] = []
|
||||
properties: Any = getattr(sample, "properties", None)
|
||||
if properties is not None:
|
||||
containers.append(properties)
|
||||
if isinstance(sample, dict):
|
||||
sample_any = cast("dict[str, Any]", sample)
|
||||
props_dict: Any = sample_any.get("properties")
|
||||
if props_dict is not None and props_dict is not properties:
|
||||
containers.append(props_dict)
|
||||
containers.append(sample_any)
|
||||
|
||||
for container in containers:
|
||||
for key in _RUBRIC_DIMENSION_KEYS:
|
||||
raw: Any = None
|
||||
if isinstance(container, dict):
|
||||
raw = cast("dict[str, Any]", container).get(key)
|
||||
elif hasattr(container, key):
|
||||
raw = getattr(container, key, None)
|
||||
parsed = _parse_dimension_entries(raw)
|
||||
if parsed:
|
||||
return parsed
|
||||
return None
|
||||
|
||||
|
||||
async def _fetch_output_items(
|
||||
client: AsyncOpenAI,
|
||||
eval_id: str,
|
||||
@@ -377,12 +578,15 @@ async def _fetch_output_items(
|
||||
# Extract per-evaluator scores
|
||||
scores: list[EvalScoreResult] = []
|
||||
for r in oi.results or []:
|
||||
sample = r.sample
|
||||
dimensions = _extract_rubric_scores(sample)
|
||||
scores.append(
|
||||
EvalScoreResult(
|
||||
name=r.name,
|
||||
score=r.score,
|
||||
passed=r.passed,
|
||||
sample=r.sample,
|
||||
sample=sample,
|
||||
dimensions=dimensions,
|
||||
)
|
||||
)
|
||||
|
||||
@@ -394,15 +598,18 @@ async def _fetch_output_items(
|
||||
output_text: str | None = None
|
||||
response_id: str | None = None
|
||||
|
||||
sample = oi.sample
|
||||
if sample is not None: # pyright: ignore[reportUnnecessaryComparison]
|
||||
err = sample.error
|
||||
if err is not None and (err.code or err.message): # pyright: ignore[reportUnnecessaryComparison]
|
||||
# mypy infers oi.sample as dict[str, object] | None, but the
|
||||
# OpenAI SDK actually returns a typed Sample model. Cast to Any so
|
||||
# both type checkers accept the attribute access pattern.
|
||||
oi_sample: Any = oi.sample
|
||||
if oi_sample is not None:
|
||||
err = oi_sample.error
|
||||
if err is not None and (err.code or err.message):
|
||||
error_code = err.code or None
|
||||
error_message = err.message or None
|
||||
|
||||
usage = sample.usage
|
||||
if usage is not None and usage.total_tokens: # pyright: ignore[reportUnnecessaryComparison]
|
||||
usage = oi_sample.usage
|
||||
if usage is not None and usage.total_tokens:
|
||||
token_usage = {
|
||||
"prompt_tokens": usage.prompt_tokens,
|
||||
"completion_tokens": usage.completion_tokens,
|
||||
@@ -411,13 +618,13 @@ async def _fetch_output_items(
|
||||
}
|
||||
|
||||
# Extract input/output text
|
||||
if sample.input:
|
||||
parts = [si.content for si in sample.input if si.role == "user"]
|
||||
if oi_sample.input:
|
||||
parts = [si.content for si in oi_sample.input if si.role == "user"]
|
||||
if parts:
|
||||
input_text = " ".join(parts)
|
||||
|
||||
if sample.output:
|
||||
parts = [so.content or "" for so in sample.output if so.role == "assistant"]
|
||||
if oi_sample.output:
|
||||
parts = [so.content or "" for so in oi_sample.output if so.role == "assistant"]
|
||||
if parts:
|
||||
output_text = " ".join(parts)
|
||||
|
||||
@@ -472,7 +679,7 @@ async def _evaluate_via_responses_impl(
|
||||
*,
|
||||
client: AsyncOpenAI,
|
||||
response_ids: Sequence[str],
|
||||
evaluators: list[str],
|
||||
evaluators: list[str | GeneratedEvaluatorRef],
|
||||
model: str,
|
||||
eval_name: str,
|
||||
poll_interval: float,
|
||||
@@ -573,8 +780,11 @@ class FoundryEvals:
|
||||
(from ``azure.ai.projects.aio``). Provide this or *client*.
|
||||
model: Model deployment name for the evaluator LLM judge.
|
||||
Resolved from ``client.model`` when omitted.
|
||||
evaluators: Evaluator names (e.g. ``["relevance", "tool_call_accuracy"]``).
|
||||
When ``None`` (default), uses smart defaults based on item data.
|
||||
evaluators: Evaluator specifications. Entries may be built-in
|
||||
short names (e.g. ``"relevance"``), fully-qualified
|
||||
``"builtin.*"`` names, or :class:`GeneratedEvaluatorRef`
|
||||
instances for previously generated rubric evaluators. When
|
||||
``None`` (default), uses smart defaults based on item data.
|
||||
conversation_split: How to split multi-turn conversations into
|
||||
query/response halves. Defaults to ``LAST_TURN``. Pass a
|
||||
``ConversationSplit`` enum value or a custom callable — see
|
||||
@@ -623,7 +833,7 @@ class FoundryEvals:
|
||||
client: FoundryChatClient | None = None,
|
||||
project_client: AIProjectClient | None = None,
|
||||
model: str | None = None,
|
||||
evaluators: Sequence[str] | None = None,
|
||||
evaluators: Sequence[str | GeneratedEvaluatorRef] | None = None,
|
||||
conversation_split: ConversationSplitter = ConversationSplit.LAST_TURN,
|
||||
poll_interval: float = 5.0,
|
||||
timeout: float = 180.0,
|
||||
@@ -642,7 +852,9 @@ class FoundryEvals:
|
||||
"Model is required. Pass model= explicitly or use a FoundryChatClient that has a model configured."
|
||||
)
|
||||
self._model = resolved_model
|
||||
self._evaluators = list(evaluators) if evaluators is not None else None
|
||||
self._evaluators: list[str | GeneratedEvaluatorRef] | None = (
|
||||
list(evaluators) if evaluators is not None else None
|
||||
)
|
||||
self._conversation_split = conversation_split
|
||||
self._poll_interval = poll_interval
|
||||
self._timeout = timeout
|
||||
@@ -678,7 +890,7 @@ class FoundryEvals:
|
||||
async def _evaluate_via_dataset(
|
||||
self,
|
||||
items: Sequence[EvalItem],
|
||||
evaluators: list[str],
|
||||
evaluators: list[str | GeneratedEvaluatorRef],
|
||||
eval_name: str,
|
||||
) -> EvalResults:
|
||||
"""Evaluate using JSONL dataset upload path."""
|
||||
|
||||
@@ -25,16 +25,25 @@ from agent_framework._evaluation import (
|
||||
from agent_framework._workflows._workflow import WorkflowRunResult
|
||||
from openai import AsyncOpenAI
|
||||
|
||||
from agent_framework_foundry import GeneratedEvaluatorRef
|
||||
from agent_framework_foundry._foundry_evals import (
|
||||
_AGENT_EVALUATORS,
|
||||
_BUILTIN_EVALUATORS,
|
||||
_TOOL_EVALUATORS,
|
||||
FoundryEvals,
|
||||
_build_item_schema,
|
||||
_build_testing_criteria,
|
||||
_extract_per_evaluator,
|
||||
_extract_result_counts,
|
||||
_extract_rubric_scores,
|
||||
_fetch_output_items,
|
||||
_filter_tool_evaluators,
|
||||
_poll_eval_run,
|
||||
_resolve_default_evaluators,
|
||||
_resolve_evaluator,
|
||||
_resolve_openai_client,
|
||||
evaluate_foundry_target,
|
||||
evaluate_traces,
|
||||
)
|
||||
|
||||
|
||||
@@ -806,6 +815,67 @@ class TestBuildTestingCriteria:
|
||||
for c in criteria:
|
||||
assert "tool_definitions" in c["data_mapping"], f"{c['name']} missing tool_definitions"
|
||||
|
||||
def test_generated_evaluator_ref_pinned_version(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
|
||||
criteria = _build_testing_criteria([ref], "gpt-4o", include_data_mapping=True)
|
||||
|
||||
assert len(criteria) == 1
|
||||
c = criteria[0]
|
||||
assert c["type"] == "azure_ai_evaluator"
|
||||
assert c["evaluator_name"] == "my-rubric"
|
||||
assert c["evaluator_version"] == "1"
|
||||
assert c["name"] == "my-rubric"
|
||||
assert c["initialization_parameters"] == {"deployment_name": "gpt-4o"}
|
||||
assert c["data_mapping"] == {
|
||||
"query": "{{item.query_messages}}",
|
||||
"response": "{{item.response_messages}}",
|
||||
}
|
||||
|
||||
def test_generated_evaluator_ref_display_name_used_as_short(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="my-rubric", version="2", display_name="My Rubric")
|
||||
criteria = _build_testing_criteria([ref], "gpt-4o")
|
||||
|
||||
assert criteria[0]["name"] == "My Rubric"
|
||||
assert criteria[0]["evaluator_name"] == "my-rubric"
|
||||
|
||||
def test_generated_evaluator_ref_tool_definitions_added(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
|
||||
criteria = _build_testing_criteria(
|
||||
[ref],
|
||||
"gpt-4o",
|
||||
include_data_mapping=True,
|
||||
include_tool_definitions=True,
|
||||
)
|
||||
|
||||
assert criteria[0]["data_mapping"]["tool_definitions"] == "{{item.tool_definitions}}"
|
||||
|
||||
def test_generated_evaluator_ref_unpinned_warns(self, caplog: pytest.LogCaptureFixture) -> None:
|
||||
import logging
|
||||
|
||||
ref = GeneratedEvaluatorRef.latest("my-rubric")
|
||||
with caplog.at_level(logging.WARNING, logger="agent_framework_foundry._foundry_evals"):
|
||||
criteria = _build_testing_criteria([ref], "gpt-4o")
|
||||
|
||||
assert "evaluator_version" not in criteria[0]
|
||||
assert any("no pinned version" in r.message for r in caplog.records)
|
||||
|
||||
def test_generated_evaluator_ref_mixed_with_builtins(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
|
||||
criteria = _build_testing_criteria(
|
||||
["relevance", ref, "task_adherence"],
|
||||
"gpt-4o",
|
||||
include_data_mapping=True,
|
||||
)
|
||||
|
||||
assert [c["name"] for c in criteria] == ["relevance", "my-rubric", "task_adherence"]
|
||||
assert criteria[0]["evaluator_name"] == "builtin.relevance"
|
||||
assert criteria[1]["evaluator_name"] == "my-rubric"
|
||||
assert criteria[2]["evaluator_name"] == "builtin.task_adherence"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _build_item_schema
|
||||
@@ -1263,6 +1333,29 @@ class TestFilterToolEvaluators:
|
||||
items,
|
||||
)
|
||||
|
||||
def test_preserves_generated_ref_when_no_tools(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="rubric", version="1")
|
||||
items = [
|
||||
EvalItem(conversation=[Message("user", ["q"]), Message("assistant", ["r"])]),
|
||||
]
|
||||
result = _filter_tool_evaluators(
|
||||
["relevance", ref, "tool_call_accuracy"],
|
||||
items,
|
||||
)
|
||||
assert "relevance" in result
|
||||
assert ref in result
|
||||
assert "tool_call_accuracy" not in result
|
||||
|
||||
def test_generated_ref_alone_does_not_raise(self) -> None:
|
||||
|
||||
ref = GeneratedEvaluatorRef(name="rubric", version="1")
|
||||
items = [
|
||||
EvalItem(conversation=[Message("user", ["q"]), Message("assistant", ["r"])]),
|
||||
]
|
||||
result = _filter_tool_evaluators([ref], items)
|
||||
assert result == [ref]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# EvalResults
|
||||
@@ -2267,7 +2360,6 @@ class TestEvalResultsWithItems:
|
||||
|
||||
class TestFetchOutputItems:
|
||||
async def test_fetches_and_converts_output_items(self) -> None:
|
||||
from agent_framework_foundry._foundry_evals import _fetch_output_items
|
||||
|
||||
# Build mock output items matching the OpenAI SDK schema
|
||||
mock_result = MagicMock()
|
||||
@@ -2329,7 +2421,6 @@ class TestFetchOutputItems:
|
||||
assert item.error_code is None
|
||||
|
||||
async def test_handles_errored_item(self) -> None:
|
||||
from agent_framework_foundry._foundry_evals import _fetch_output_items
|
||||
|
||||
mock_error = MagicMock()
|
||||
mock_error.code = "QueryExtractionError"
|
||||
@@ -2361,7 +2452,6 @@ class TestFetchOutputItems:
|
||||
assert len(item.scores) == 0
|
||||
|
||||
async def test_handles_api_failure_gracefully(self) -> None:
|
||||
from agent_framework_foundry._foundry_evals import _fetch_output_items
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.evals.runs.output_items.list = AsyncMock(side_effect=TypeError("API error"))
|
||||
@@ -2369,6 +2459,166 @@ class TestFetchOutputItems:
|
||||
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
|
||||
assert items == []
|
||||
|
||||
async def test_extracts_rubric_scores_from_dict_sample(self) -> None:
|
||||
|
||||
mock_result = MagicMock()
|
||||
mock_result.name = "my-rubric"
|
||||
mock_result.score = 0.85
|
||||
mock_result.passed = True
|
||||
mock_result.sample = {
|
||||
"properties": {
|
||||
"rubric_scores": [
|
||||
{"id": "policy", "score": 4, "applicable": True, "weight": 1, "reason": "ok"},
|
||||
{"id": "safety", "score": None, "applicable": False, "weight": 1, "reason": "n/a"},
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
mock_oi = MagicMock()
|
||||
mock_oi.id = "oi_1"
|
||||
mock_oi.status = "pass"
|
||||
mock_oi.results = [mock_result]
|
||||
mock_oi.sample = None
|
||||
mock_oi.datasource_item = {}
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.evals.runs.output_items.list = AsyncMock(return_value=_AsyncPage([mock_oi]))
|
||||
|
||||
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
|
||||
|
||||
assert len(items) == 1
|
||||
scores = items[0].scores
|
||||
assert len(scores) == 1
|
||||
assert scores[0].dimensions is not None
|
||||
assert len(scores[0].dimensions) == 2
|
||||
policy = next(d for d in scores[0].dimensions if d.id == "policy")
|
||||
assert policy.score == 4
|
||||
assert policy.applicable is True
|
||||
assert policy.weight == 1
|
||||
assert policy.reason == "ok"
|
||||
safety = next(d for d in scores[0].dimensions if d.id == "safety")
|
||||
assert safety.score is None
|
||||
assert safety.applicable is False
|
||||
|
||||
async def test_no_rubric_scores_when_absent(self) -> None:
|
||||
|
||||
mock_result = MagicMock()
|
||||
mock_result.name = "relevance"
|
||||
mock_result.score = 0.85
|
||||
mock_result.passed = True
|
||||
mock_result.sample = None
|
||||
|
||||
mock_oi = MagicMock()
|
||||
mock_oi.id = "oi_2"
|
||||
mock_oi.status = "pass"
|
||||
mock_oi.results = [mock_result]
|
||||
mock_oi.sample = None
|
||||
mock_oi.datasource_item = {}
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_client.evals.runs.output_items.list = AsyncMock(return_value=_AsyncPage([mock_oi]))
|
||||
|
||||
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
|
||||
|
||||
assert items[0].scores[0].dimensions is None
|
||||
|
||||
|
||||
class TestExtractRubricScores:
|
||||
def test_handles_attribute_style_properties(self) -> None:
|
||||
|
||||
rs = MagicMock()
|
||||
rs.id = "policy"
|
||||
rs.score = 5
|
||||
rs.applicable = True
|
||||
rs.weight = 2
|
||||
rs.reason = "ok"
|
||||
|
||||
sample = MagicMock()
|
||||
sample.properties = MagicMock()
|
||||
sample.properties.rubric_scores = [rs]
|
||||
|
||||
result = _extract_rubric_scores(sample)
|
||||
assert result is not None
|
||||
assert result[0].id == "policy"
|
||||
assert result[0].score == 5
|
||||
assert result[0].weight == 2
|
||||
|
||||
def test_top_level_rubric_scores_in_dict(self) -> None:
|
||||
|
||||
sample = {"rubric_scores": [{"id": "a", "score": 3, "applicable": True, "weight": 1, "reason": "r"}]}
|
||||
result = _extract_rubric_scores(sample)
|
||||
assert result is not None
|
||||
assert result[0].id == "a"
|
||||
|
||||
def test_returns_none_when_missing(self) -> None:
|
||||
|
||||
assert _extract_rubric_scores(None) is None
|
||||
assert _extract_rubric_scores({}) is None
|
||||
assert _extract_rubric_scores({"properties": {}}) is None
|
||||
|
||||
def test_skips_malformed_entries(self) -> None:
|
||||
|
||||
sample = {
|
||||
"properties": {
|
||||
"rubric_scores": [
|
||||
{"id": "good", "score": 3, "applicable": True, "weight": 1, "reason": "ok"},
|
||||
{"id": "bad-no-weight", "score": 2, "applicable": True, "reason": "x"},
|
||||
]
|
||||
}
|
||||
}
|
||||
result = _extract_rubric_scores(sample)
|
||||
assert result is not None
|
||||
assert len(result) == 1
|
||||
assert result[0].id == "good"
|
||||
|
||||
def test_canonical_dimension_scores_key_from_docs(self) -> None:
|
||||
"""Per the Microsoft Learn docs, runtime output uses ``properties.dimension_scores``."""
|
||||
|
||||
sample = {
|
||||
"properties": {
|
||||
"dimension_scores": [
|
||||
{
|
||||
"id": "intent_recognition",
|
||||
"score": 5,
|
||||
"applicable": True,
|
||||
"weight": 9,
|
||||
"reason": "Identified correctly.",
|
||||
},
|
||||
{
|
||||
"id": "general_quality",
|
||||
"score": 4,
|
||||
"applicable": True,
|
||||
"weight": 5,
|
||||
"reason": "Strong overall.",
|
||||
},
|
||||
]
|
||||
}
|
||||
}
|
||||
result = _extract_rubric_scores(sample)
|
||||
assert result is not None
|
||||
assert [r.id for r in result] == ["intent_recognition", "general_quality"]
|
||||
assert [r.score for r in result] == [5, 4]
|
||||
assert [r.weight for r in result] == [9, 5]
|
||||
|
||||
def test_dimension_scores_via_attribute(self) -> None:
|
||||
"""Canonical key also resolves when properties exposes ``dimension_scores`` as an attr."""
|
||||
|
||||
rs = MagicMock()
|
||||
rs.id = "policy_enforcement"
|
||||
rs.score = 1
|
||||
rs.applicable = True
|
||||
rs.weight = 5
|
||||
rs.reason = "violated"
|
||||
|
||||
sample = MagicMock()
|
||||
sample.properties = MagicMock(spec=["dimension_scores"])
|
||||
sample.properties.dimension_scores = [rs]
|
||||
|
||||
result = _extract_rubric_scores(sample)
|
||||
assert result is not None
|
||||
assert result[0].id == "policy_enforcement"
|
||||
assert result[0].score == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _poll_eval_run — timeout / failed / canceled paths
|
||||
@@ -2378,7 +2628,6 @@ class TestFetchOutputItems:
|
||||
class TestPollEvalRun:
|
||||
async def test_timeout_returns_timeout_status(self) -> None:
|
||||
"""Poll timeout returns EvalResults with status='timeout'."""
|
||||
from agent_framework_foundry._foundry_evals import _poll_eval_run
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_pending = MagicMock()
|
||||
@@ -2392,7 +2641,6 @@ class TestPollEvalRun:
|
||||
|
||||
async def test_failed_run_returns_error(self) -> None:
|
||||
"""Failed run returns EvalResults with error message."""
|
||||
from agent_framework_foundry._foundry_evals import _poll_eval_run
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_failed = MagicMock()
|
||||
@@ -2410,7 +2658,6 @@ class TestPollEvalRun:
|
||||
|
||||
async def test_canceled_run_returns_canceled_status(self) -> None:
|
||||
"""Canceled run returns EvalResults with status='canceled'."""
|
||||
from agent_framework_foundry._foundry_evals import _poll_eval_run
|
||||
|
||||
mock_client = MagicMock()
|
||||
mock_canceled = MagicMock()
|
||||
@@ -2435,7 +2682,6 @@ class TestPollEvalRun:
|
||||
class TestEvaluateTraces:
|
||||
async def test_raises_without_required_args(self) -> None:
|
||||
"""Raises ValueError when no response_ids, trace_ids, or agent_id given."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_traces
|
||||
|
||||
mock_client = MagicMock()
|
||||
with pytest.raises(ValueError, match="Provide at least one of"):
|
||||
@@ -2446,7 +2692,6 @@ class TestEvaluateTraces:
|
||||
|
||||
async def test_response_ids_path(self) -> None:
|
||||
"""evaluate_traces with response_ids uses the responses API path."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_traces
|
||||
|
||||
mock_client = MagicMock()
|
||||
|
||||
@@ -2494,7 +2739,6 @@ class TestEvaluateTraces:
|
||||
|
||||
async def test_trace_ids_path(self) -> None:
|
||||
"""evaluate_traces with trace_ids builds azure_ai_traces data source."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_traces
|
||||
|
||||
mock_client = MagicMock()
|
||||
|
||||
@@ -2534,7 +2778,6 @@ class TestEvaluateTraces:
|
||||
class TestEvaluateFoundryTarget:
|
||||
async def test_happy_path(self) -> None:
|
||||
"""evaluate_foundry_target creates eval + run and polls to completion."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_foundry_target
|
||||
|
||||
mock_client = MagicMock()
|
||||
|
||||
@@ -2670,13 +2913,11 @@ class TestEvaluatorSetConsistency:
|
||||
"""Verify that _AGENT_EVALUATORS and _TOOL_EVALUATORS are subsets of _BUILTIN_EVALUATORS."""
|
||||
|
||||
def test_agent_evaluators_subset(self):
|
||||
from agent_framework_foundry._foundry_evals import _AGENT_EVALUATORS, _BUILTIN_EVALUATORS
|
||||
|
||||
diff = _AGENT_EVALUATORS - set(_BUILTIN_EVALUATORS.values())
|
||||
assert not diff, f"_AGENT_EVALUATORS has names not in _BUILTIN_EVALUATORS: {diff}"
|
||||
|
||||
def test_tool_evaluators_subset(self):
|
||||
from agent_framework_foundry._foundry_evals import _BUILTIN_EVALUATORS, _TOOL_EVALUATORS
|
||||
|
||||
diff = _TOOL_EVALUATORS - set(_BUILTIN_EVALUATORS.values())
|
||||
assert not diff, f"_TOOL_EVALUATORS has names not in _BUILTIN_EVALUATORS: {diff}"
|
||||
@@ -2690,7 +2931,6 @@ class TestEvaluatorSetConsistency:
|
||||
class TestEvaluateTracesAgentId:
|
||||
async def test_agent_id_only_path(self) -> None:
|
||||
"""evaluate_traces with agent_id only builds azure_ai_traces data source."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_traces
|
||||
|
||||
mock_client = MagicMock()
|
||||
|
||||
@@ -2748,7 +2988,6 @@ class TestFilterToolEvaluatorsRaises:
|
||||
class TestEvaluateFoundryTargetValidation:
|
||||
async def test_target_without_type_raises(self) -> None:
|
||||
"""target dict without 'type' key raises ValueError."""
|
||||
from agent_framework_foundry._foundry_evals import evaluate_foundry_target
|
||||
|
||||
mock_client = MagicMock()
|
||||
with pytest.raises(ValueError, match="'type' key"):
|
||||
|
||||
Reference in New Issue
Block a user