Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)

* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators=

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): parse rubric_scores from output items + assertion helpers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): YAML config loader + sample

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: fix(evals): address PR review feedback

Addresses 4 Copilot review comments on PR #6101:

1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression.

2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path.

3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run.

4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): hosted-agent-aware rubric generation

* Auto-detect hosted Foundry agents in agent_as_eval_source: when the
  agent's chat_client exposes a string agent_name (the convention used
  by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a
  type='agent' EvalGenerationSource so the service fetches instructions
  and tools from the agent registry instead of relying on the local
  wrapper (which holds neither for hosted agents).
* Add hosted_agent_version kwarg and a new agent_version field on
  EvalGenerationSource so PromptAgent runs can pin to a specific hosted
  version for reproducible rubric generation.
* Add force_prompt_source escape hatch to bypass auto-detection and
  always emit a rendered prompt dossier - useful when the local wrapper
  carries overrides the service-side agent doesnt see.
* Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=,
  not dataset_name=/dataset_version=. The mismatch would raise TypeError
  against the real azure-ai-projects 2.3.0a* SDK; only unmocked
  integration paths were affected.

Tests cover: auto-detection happy path, versionless hosted agent,
explicit hosted_agent_version forwarding, force_prompt_source override,
non-string chat_client attrs (MagicMock test doubles) not mis-detected,
agent_version forwarded through _to_sdk_source, and the corrected
dataset SDK kwarg names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): accept canonical dimension_scores key per docs

The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): add manual create_rubric_evaluator

Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety).

Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent.

Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): manual rubric sample + namespace re-exports

Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run.

Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config.

Updates the foundry-evals README with a chooser entry for the two rubric paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): remove rubric creation flows; keep consumption only

Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK.

Removed creation surface area:

- FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers.

- EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources).

- BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline.

- Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef.

Kept (consumption surface):

- GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn.

- RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates.

- _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key).

Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): add evaluate_with_rubric_sample

Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): satisfy mypy on _fetch_output_items

mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link

The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(foundry-evals): hoist repeated local imports to module top

Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This commit is contained in:
Ben Thomas
2026-06-01 16:01:56 -07:00
committed by GitHub
Unverified
parent f36096ce1a
commit e0d0ad16a0
11 changed files with 951 additions and 54 deletions
@@ -71,6 +71,7 @@ from ._evaluation import (
Evaluator,
ExpectedToolCall,
LocalEvaluator,
RubricScore,
evaluate_agent,
evaluate_workflow,
evaluator,
@@ -460,6 +461,7 @@ __all__ = [
"ResponseStream",
"Role",
"RoleLiteral",
"RubricScore",
"RunContext",
"Runner",
"RunnerContext",
@@ -311,12 +311,15 @@ class EvalScoreResult:
score: Numeric score from the evaluator.
passed: Whether the item passed this evaluator's threshold.
sample: Optional raw evaluator output (rationale, metadata).
dimensions: Per-dimension scores when this evaluator is a rubric
evaluator. ``None`` for non-rubric (e.g. built-in) evaluators.
"""
name: str
score: float
passed: bool | None = None
sample: dict[str, Any] | None = None
dimensions: list[RubricScore] | None = None
@experimental(feature_id=ExperimentalFeature.EVALS)
@@ -496,6 +499,179 @@ class EvalResults:
detail += f" Errored items: {', '.join(summaries)}."
raise EvalNotPassedError(detail)
def assert_score_at_least(
self,
min_score: float,
*,
evaluator: str | None = None,
msg: str | None = None,
) -> None:
"""Assert every item's score (optionally filtered by evaluator) is ``>= min_score``.
Designed for CI gates on generated rubric evaluators (e.g.
``results.assert_score_at_least(0.80)``). Includes any
sub-results from workflow evaluations.
Args:
min_score: Minimum acceptable score (inclusive).
evaluator: When set, only check scores from the evaluator
whose ``EvalScoreResult.name`` matches.
msg: Optional custom failure message.
Raises:
EvalNotPassedError: When any matching score is below the threshold.
"""
offenders: list[str] = []
def _check(results: EvalResults) -> None:
for item in results.items:
for score in item.scores:
if evaluator is not None and score.name != evaluator:
continue
if score.score < min_score:
offenders.append(f"{item.item_id}/{score.name}={score.score:.3f}")
for sub in results.sub_results.values():
_check(sub)
_check(self)
if offenders:
detail = msg or (
f"{len(offenders)} score(s) below threshold {min_score}"
f"{' for ' + evaluator if evaluator else ''}: {', '.join(offenders[:5])}"
+ (f" (+{len(offenders) - 5} more)" if len(offenders) > 5 else "")
)
raise EvalNotPassedError(detail)
def assert_dimension_score_at_least(
self,
dimension_id: str,
min_score: float,
*,
evaluator: str | None = None,
require_applicable: bool = False,
msg: str | None = None,
) -> None:
"""Assert every item's score for a rubric *dimension* is ``>= min_score``.
Walks ``EvalScoreResult.dimensions`` looking for the named
dimension across all items (and sub-results). Non-applicable
dimensions are skipped by default; pass
``require_applicable=True`` to fail when no applicable score is
produced.
Args:
dimension_id: Dimension id (matches the rubric definition).
min_score: Minimum acceptable dimension score (inclusive).
evaluator: When set, only consider scores from the evaluator
whose ``EvalScoreResult.name`` matches.
require_applicable: When ``True``, missing or non-applicable
dimension scores raise. Defaults to ``False`` (skip).
msg: Optional custom failure message.
Raises:
EvalNotPassedError: When the dimension fails the threshold.
"""
offenders: list[str] = []
missing_items: list[str] = []
def _check(results: EvalResults) -> None:
for item in results.items:
found_applicable = False
for score in item.scores:
if evaluator is not None and score.name != evaluator:
continue
if not score.dimensions:
continue
for rs in score.dimensions:
if rs.id != dimension_id:
continue
if not rs.applicable:
continue
found_applicable = True
if rs.score is None or rs.score < min_score:
offenders.append(
f"{item.item_id}/{score.name}/{dimension_id}="
f"{rs.score if rs.score is not None else 'None'}"
)
if require_applicable and not found_applicable:
missing_items.append(item.item_id)
for sub in results.sub_results.values():
_check(sub)
_check(self)
problems: list[str] = []
if offenders:
problems.append(
f"{len(offenders)} dimension score(s) for '{dimension_id}' below {min_score}: "
f"{', '.join(offenders[:5])}" + (f" (+{len(offenders) - 5} more)" if len(offenders) > 5 else "")
)
if missing_items:
problems.append(
f"Dimension '{dimension_id}' not applicable on {len(missing_items)} item(s): "
f"{', '.join(missing_items[:5])}"
)
if problems:
raise EvalNotPassedError(msg or "; ".join(problems))
def assert_no_failed_items(self, msg: str | None = None) -> None:
"""Assert no item ended in ``fail`` or ``error`` status.
Includes any sub-results from workflow evaluations.
Args:
msg: Optional custom failure message.
Raises:
EvalNotPassedError: When any item failed or errored.
"""
bad: list[str] = []
def _check(results: EvalResults) -> None:
for item in results.items:
if item.is_failed or item.is_error:
bad.append(f"{item.item_id}:{item.status}")
for sub in results.sub_results.values():
_check(sub)
_check(self)
if bad:
detail = msg or (
f"{len(bad)} item(s) failed or errored: {', '.join(bad[:5])}"
+ (f" (+{len(bad) - 5} more)" if len(bad) > 5 else "")
)
raise EvalNotPassedError(detail)
# endregion
# region Generated rubric evaluators
@experimental(feature_id=ExperimentalFeature.EVALS)
@dataclass(frozen=True)
class RubricScore:
"""A single dimension's score from a rubric-based evaluator run.
Rubric evaluators emit one ``RubricScore`` per dimension per item.
Attached to :class:`EvalScoreResult` as a typed view of the raw
``properties.rubric_scores`` payload returned by providers such as
Foundry's generated rubric evaluators.
Attributes:
id: Dimension id (matches the rubric definition).
score: Numeric score, or ``None`` when the dimension was marked
non-applicable for this item.
applicable: Whether the dimension applied to this item.
weight: Dimension weight (mirrors the rubric definition).
reason: Short rationale produced by the evaluator.
"""
id: str
score: int | None
applicable: bool
weight: int
reason: str
# endregion
@@ -34,6 +34,7 @@ _IMPORTS: dict[str, tuple[str, str]] = {
"FoundryLocalChatOptions": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
"FoundryLocalClient": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
"FoundryLocalSettings": ("agent_framework_foundry_local", "agent-framework-foundry-local"),
"GeneratedEvaluatorRef": ("agent_framework_foundry", "agent-framework-foundry"),
"RawAnthropicFoundryClient": ("agent_framework_anthropic", "agent-framework-anthropic"),
"RawFoundryAgent": ("agent_framework_foundry", "agent-framework-foundry"),
"RawFoundryAgentChatClient": ("agent_framework_foundry", "agent-framework-foundry"),
@@ -20,6 +20,7 @@ from agent_framework_foundry import (
FoundryEmbeddingSettings,
FoundryEvals,
FoundryMemoryProvider,
GeneratedEvaluatorRef,
RawFoundryAgent,
RawFoundryAgentChatClient,
RawFoundryChatClient,
@@ -52,6 +53,7 @@ __all__ = [
"FoundryLocalClient",
"FoundryLocalSettings",
"FoundryMemoryProvider",
"GeneratedEvaluatorRef",
"RawAnthropicFoundryClient",
"RawFoundryAgent",
"RawFoundryAgentChatClient",
@@ -11,8 +11,13 @@ import pytest
from agent_framework._evaluation import (
CheckResult,
EvalItem,
EvalItemResult,
EvalNotPassedError,
EvalResults,
EvalScoreResult,
ExpectedToolCall,
LocalEvaluator,
RubricScore,
_coerce_result,
evaluator,
keyword_check,
@@ -1010,19 +1015,101 @@ class TestAllPassedSubResults:
# ---------------------------------------------------------------------------
# r5 review: _build_overall_item with empty outputs
# Rubric assertions (EvalResults.assert_*)
# ---------------------------------------------------------------------------
class TestBuildOverallItemEmpty:
"""Test _build_overall_item returns None for empty workflow outputs."""
def _rubric_results(*scores_per_item: list[EvalScoreResult]) -> EvalResults:
items = [
EvalItemResult(item_id=f"item-{i}", status="pass", scores=scores) for i, scores in enumerate(scores_per_item)
]
return EvalResults(
provider="test",
eval_id="ev1",
run_id="run1",
result_counts={"passed": len(items), "failed": 0, "errored": 0, "total": len(items)},
items=items,
)
def test_returns_none_for_empty_outputs(self):
from unittest.mock import MagicMock
from agent_framework._evaluation import _build_overall_item
class TestRubricAssertions:
"""Tests for EvalResults.assert_dimension_score_at_least."""
mock_result = MagicMock()
mock_result.get_outputs.return_value = []
item = _build_overall_item("Hello", mock_result)
assert item is None
def test_dimension_at_or_above_threshold_passes(self) -> None:
results = _rubric_results(
[
EvalScoreResult(
name="policy",
score=0.9,
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
)
],
)
# Should not raise.
results.assert_dimension_score_at_least("clarity", 3)
def test_dimension_below_threshold_raises(self) -> None:
results = _rubric_results(
[
EvalScoreResult(
name="policy",
score=0.5,
dimensions=[RubricScore(id="clarity", score=2, applicable=True, weight=1, reason="")],
)
],
)
with pytest.raises(EvalNotPassedError):
results.assert_dimension_score_at_least("clarity", 3)
def test_non_applicable_skipped_by_default(self) -> None:
results = _rubric_results(
[
EvalScoreResult(
name="policy",
score=1.0,
dimensions=[RubricScore(id="clarity", score=None, applicable=False, weight=1, reason="n/a")],
)
],
)
# No applicable scores; default behaviour is to skip silently.
results.assert_dimension_score_at_least("clarity", 3)
def test_require_applicable_raises_when_dimension_absent(self) -> None:
results = _rubric_results(
[EvalScoreResult(name="policy", score=1.0, dimensions=[])],
)
with pytest.raises(EvalNotPassedError, match="not applicable"):
results.assert_dimension_score_at_least("clarity", 3, require_applicable=True)
def test_require_applicable_raises_when_filtered_evaluator_missing(self) -> None:
# Regression: previously the (not evaluator or found_any) guard caused
# this case to silently pass even with require_applicable=True.
results = _rubric_results(
[
EvalScoreResult(
name="other",
score=0.9,
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
)
],
)
with pytest.raises(EvalNotPassedError, match="not applicable"):
results.assert_dimension_score_at_least("clarity", 3, evaluator="policy", require_applicable=True)
def test_evaluator_filter_isolates_offenders(self) -> None:
results = _rubric_results(
[
EvalScoreResult(
name="other",
score=0.1,
dimensions=[RubricScore(id="clarity", score=1, applicable=True, weight=1, reason="")],
),
EvalScoreResult(
name="policy",
score=0.9,
dimensions=[RubricScore(id="clarity", score=4, applicable=True, weight=1, reason="")],
),
],
)
# The low-scoring "other" evaluator is filtered out; "policy" passes.
results.assert_dimension_score_at_least("clarity", 3, evaluator="policy")
@@ -12,6 +12,7 @@ from ._embedding_client import (
)
from ._foundry_evals import (
FoundryEvals,
GeneratedEvaluatorRef,
evaluate_foundry_target,
evaluate_traces,
)
@@ -33,6 +34,7 @@ __all__ = [
"FoundryEmbeddingSettings",
"FoundryEvals",
"FoundryMemoryProvider",
"GeneratedEvaluatorRef",
"RawFoundryAgent",
"RawFoundryAgentChatClient",
"RawFoundryChatClient",
@@ -28,8 +28,9 @@ from __future__ import annotations
import asyncio
import logging
from collections.abc import Sequence
from typing import TYPE_CHECKING, Any
from collections.abc import Iterable, Sequence
from dataclasses import dataclass
from typing import TYPE_CHECKING, Any, cast
from agent_framework._evaluation import (
AgentEvalConverter,
@@ -39,6 +40,7 @@ from agent_framework._evaluation import (
EvalItemResult,
EvalResults,
EvalScoreResult,
RubricScore,
)
from agent_framework._feature_stage import ExperimentalFeature, experimental
from openai import AsyncOpenAI
@@ -51,6 +53,54 @@ if TYPE_CHECKING:
logger = logging.getLogger(__name__)
# region Generated rubric evaluator references
@experimental(feature_id=ExperimentalFeature.EVALS)
@dataclass(frozen=True)
class GeneratedEvaluatorRef:
"""A reference to a rubric evaluator that already exists in Foundry.
Pass instances of this class to :class:`FoundryEvals` to score items
with a pre-existing rubric evaluator (manually authored or
auto-generated through the Foundry portal). agent-framework is a
consumer here: it does not create or modify the evaluator definition;
it only references the persisted version by name.
Pinning ``version`` is strongly recommended so evaluation runs are
reproducible. ``version=None`` resolves to whichever version is
current at execution time; :class:`FoundryEvals` emits a warning when
a versionless reference is used. CI gates should always pass a
concrete version.
Attributes:
name: Evaluator name as stored in the Foundry project (for
example ``"reservation-policy-rubric"``). Distinct from
built-in evaluators such as ``"builtin.relevance"``.
version: Pinned evaluator version. ``None`` means "latest"
this is discouraged for CI/repro and :class:`FoundryEvals`
will emit a warning when used.
display_name: Optional human-readable name used in result
summaries. Defaults to ``name`` when unset.
"""
name: str
version: str | None = None
display_name: str | None = None
@classmethod
def latest(cls, name: str, *, display_name: str | None = None) -> GeneratedEvaluatorRef:
"""Construct a versionless reference (resolves to the latest version at run time).
Discouraged for reproducible runs. Prefer the constructor with
an explicit ``version`` so CI and replay evaluations stay stable
when the evaluator is updated in Foundry.
"""
return cls(name=name, version=None, display_name=display_name)
# endregion
# Agent evaluators that accept query/response as conversation arrays.
# Maintained manually — check https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk
# for the latest evaluator list. These are the evaluators that need conversation-format input.
@@ -166,7 +216,7 @@ def _resolve_evaluator(name: str) -> str:
def _build_testing_criteria(
evaluators: Sequence[str],
evaluators: Sequence[str | GeneratedEvaluatorRef],
model: str,
*,
include_data_mapping: bool = False,
@@ -175,7 +225,9 @@ def _build_testing_criteria(
"""Build ``testing_criteria`` for ``evals.create()``.
Args:
evaluators: Evaluator names.
evaluators: Evaluator names (built-in shorts / fully-qualified
``builtin.*`` names) or :class:`GeneratedEvaluatorRef`
instances for generated rubric evaluators.
model: Model deployment for the LLM judge.
include_data_mapping: Whether to include field-level data mapping
(required for the JSONL data source, not needed for response-based).
@@ -183,7 +235,38 @@ def _build_testing_criteria(
definitions.
"""
criteria: list[dict[str, Any]] = []
for name in evaluators:
for entry_spec in evaluators:
if isinstance(entry_spec, GeneratedEvaluatorRef):
short = entry_spec.display_name or entry_spec.name
ref_entry: dict[str, Any] = {
"type": "azure_ai_evaluator",
"name": short,
"evaluator_name": entry_spec.name,
"initialization_parameters": {"deployment_name": model},
}
if entry_spec.version is not None:
ref_entry["evaluator_version"] = entry_spec.version
else:
logger.warning(
"GeneratedEvaluatorRef '%s' has no pinned version; the eval run "
"will resolve to whichever version is current at execution time. "
"Pin the version for reproducible runs.",
entry_spec.name,
)
if include_data_mapping:
# Rubric evaluators accept conversation arrays like agent
# evaluators, plus tool_definitions when items are tool-aware.
ref_mapping: dict[str, str] = {
"query": "{{item.query_messages}}",
"response": "{{item.response_messages}}",
}
if include_tool_definitions:
ref_mapping["tool_definitions"] = "{{item.tool_definitions}}"
ref_entry["data_mapping"] = ref_mapping
criteria.append(ref_entry)
continue
name = entry_spec
qualified = _resolve_evaluator(name)
short = name if not name.startswith("builtin.") else name.split(".")[-1]
@@ -247,9 +330,9 @@ def _build_item_schema(
def _resolve_default_evaluators(
evaluators: Sequence[str] | None,
evaluators: Sequence[str | GeneratedEvaluatorRef] | None,
items: Sequence[EvalItem | dict[str, Any]] | None = None,
) -> list[str]:
) -> list[str | GeneratedEvaluatorRef]:
"""Resolve evaluators, applying defaults when ``None``.
Defaults to relevance + coherence + task_adherence. Automatically adds
@@ -258,7 +341,7 @@ def _resolve_default_evaluators(
if evaluators is not None:
return list(evaluators)
result = list(_DEFAULT_EVALUATORS)
result: list[str | GeneratedEvaluatorRef] = list(_DEFAULT_EVALUATORS)
if items is not None:
has_tools = any((item.tools if isinstance(item, EvalItem) else item.get("tool_definitions")) for item in items)
if has_tools:
@@ -267,14 +350,24 @@ def _resolve_default_evaluators(
def _filter_tool_evaluators(
evaluators: list[str],
evaluators: list[str | GeneratedEvaluatorRef],
items: Sequence[EvalItem | dict[str, Any]],
) -> list[str]:
"""Remove tool evaluators if no items have tool definitions."""
) -> list[str | GeneratedEvaluatorRef]:
"""Remove tool evaluators if no items have tool definitions.
Generated rubric evaluators are tool-aware but not tool-required; they
are preserved regardless of whether items carry tool definitions.
"""
has_tools = any((item.tools if isinstance(item, EvalItem) else item.get("tool_definitions")) for item in items)
if has_tools:
return evaluators
filtered = [e for e in evaluators if _resolve_evaluator(e) not in _TOOL_EVALUATORS]
def _is_tool_only(spec: str | GeneratedEvaluatorRef) -> bool:
if isinstance(spec, GeneratedEvaluatorRef):
return False
return _resolve_evaluator(spec) in _TOOL_EVALUATORS
filtered = [e for e in evaluators if not _is_tool_only(e)]
if not filtered:
raise ValueError(
f"All requested evaluators {evaluators} require tool definitions, "
@@ -282,7 +375,7 @@ def _filter_tool_evaluators(
"or choose evaluators that do not require tools."
)
if len(filtered) < len(evaluators):
removed = [e for e in evaluators if _resolve_evaluator(e) in _TOOL_EVALUATORS]
removed = [e for e in evaluators if _is_tool_only(e)]
logger.info("Removed tool evaluators %s (no items have tools)", removed)
return filtered
@@ -354,6 +447,114 @@ def _extract_per_evaluator(run: RunRetrieveResponse) -> dict[str, dict[str, int]
return per_eval
_RUBRIC_DIMENSION_KEYS: tuple[str, ...] = ("dimension_scores", "rubric_scores")
"""Property keys that may carry per-dimension rubric breakdowns.
The published Foundry rubric-evaluator output format uses
``properties.dimension_scores`` (see the Microsoft Learn "Rubric
evaluators" reference). Earlier preview builds and some SDK shapes
used ``rubric_scores``; we accept both for defensive forward/backward
compatibility.
"""
def _parse_dimension_entries(raw: Any) -> list[RubricScore]:
"""Parse a raw list-like payload into ``RubricScore`` instances.
Returns an empty list when ``raw`` is falsy, not iterable, or
contains no well-formed entries.
"""
if not raw:
return []
try:
raw_iter: Iterable[Any] = iter(raw)
except TypeError:
return []
parsed: list[RubricScore] = []
for raw_entry in raw_iter:
entry: Any = raw_entry
try:
rid: Any
score_val: Any
applicable: Any
weight: Any
reason: Any
if isinstance(entry, dict):
entry_any = cast("dict[str, Any]", entry)
rid = entry_any.get("id")
score_val = entry_any.get("score")
applicable = entry_any.get("applicable")
weight = entry_any.get("weight")
reason = entry_any.get("reason", "")
else:
rid = getattr(entry, "id", None)
score_val = getattr(entry, "score", None)
applicable = getattr(entry, "applicable", None)
weight = getattr(entry, "weight", None)
reason = getattr(entry, "reason", "") or ""
if rid is None or weight is None or applicable is None:
continue
parsed.append(
RubricScore(
id=str(rid),
score=int(score_val) if isinstance(score_val, (int, float)) else None,
applicable=bool(applicable),
weight=int(weight),
reason=str(reason) if reason is not None else "",
)
)
except (TypeError, ValueError):
logger.debug("Skipping malformed rubric dimension entry: %s", cast("Any", entry), exc_info=True)
return parsed
def _extract_rubric_scores(sample: Any) -> list[RubricScore] | None:
"""Extract typed ``RubricScore`` instances from an evaluator's raw sample payload.
Foundry rubric evaluators include a per-dimension breakdown under
``properties.dimension_scores`` on each result (preview builds used
``rubric_scores``; both keys are accepted, with the canonical
``dimension_scores`` taking priority). The exact location may
vary across SDK versions, so this helper accepts a few shapes:
* The SDK ``sample`` object exposes
``properties.dimension_scores`` / ``properties.rubric_scores``.
* The ``sample`` is a dict containing the same under
``properties.<key>``.
* The ``sample`` is a dict with ``dimension_scores`` /
``rubric_scores`` at the top level.
Returns ``None`` when no rubric scores are present (i.e. the
evaluator was not a rubric evaluator).
"""
if sample is None:
return None
containers: list[Any] = []
properties: Any = getattr(sample, "properties", None)
if properties is not None:
containers.append(properties)
if isinstance(sample, dict):
sample_any = cast("dict[str, Any]", sample)
props_dict: Any = sample_any.get("properties")
if props_dict is not None and props_dict is not properties:
containers.append(props_dict)
containers.append(sample_any)
for container in containers:
for key in _RUBRIC_DIMENSION_KEYS:
raw: Any = None
if isinstance(container, dict):
raw = cast("dict[str, Any]", container).get(key)
elif hasattr(container, key):
raw = getattr(container, key, None)
parsed = _parse_dimension_entries(raw)
if parsed:
return parsed
return None
async def _fetch_output_items(
client: AsyncOpenAI,
eval_id: str,
@@ -377,12 +578,15 @@ async def _fetch_output_items(
# Extract per-evaluator scores
scores: list[EvalScoreResult] = []
for r in oi.results or []:
sample = r.sample
dimensions = _extract_rubric_scores(sample)
scores.append(
EvalScoreResult(
name=r.name,
score=r.score,
passed=r.passed,
sample=r.sample,
sample=sample,
dimensions=dimensions,
)
)
@@ -394,15 +598,18 @@ async def _fetch_output_items(
output_text: str | None = None
response_id: str | None = None
sample = oi.sample
if sample is not None: # pyright: ignore[reportUnnecessaryComparison]
err = sample.error
if err is not None and (err.code or err.message): # pyright: ignore[reportUnnecessaryComparison]
# mypy infers oi.sample as dict[str, object] | None, but the
# OpenAI SDK actually returns a typed Sample model. Cast to Any so
# both type checkers accept the attribute access pattern.
oi_sample: Any = oi.sample
if oi_sample is not None:
err = oi_sample.error
if err is not None and (err.code or err.message):
error_code = err.code or None
error_message = err.message or None
usage = sample.usage
if usage is not None and usage.total_tokens: # pyright: ignore[reportUnnecessaryComparison]
usage = oi_sample.usage
if usage is not None and usage.total_tokens:
token_usage = {
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
@@ -411,13 +618,13 @@ async def _fetch_output_items(
}
# Extract input/output text
if sample.input:
parts = [si.content for si in sample.input if si.role == "user"]
if oi_sample.input:
parts = [si.content for si in oi_sample.input if si.role == "user"]
if parts:
input_text = " ".join(parts)
if sample.output:
parts = [so.content or "" for so in sample.output if so.role == "assistant"]
if oi_sample.output:
parts = [so.content or "" for so in oi_sample.output if so.role == "assistant"]
if parts:
output_text = " ".join(parts)
@@ -472,7 +679,7 @@ async def _evaluate_via_responses_impl(
*,
client: AsyncOpenAI,
response_ids: Sequence[str],
evaluators: list[str],
evaluators: list[str | GeneratedEvaluatorRef],
model: str,
eval_name: str,
poll_interval: float,
@@ -573,8 +780,11 @@ class FoundryEvals:
(from ``azure.ai.projects.aio``). Provide this or *client*.
model: Model deployment name for the evaluator LLM judge.
Resolved from ``client.model`` when omitted.
evaluators: Evaluator names (e.g. ``["relevance", "tool_call_accuracy"]``).
When ``None`` (default), uses smart defaults based on item data.
evaluators: Evaluator specifications. Entries may be built-in
short names (e.g. ``"relevance"``), fully-qualified
``"builtin.*"`` names, or :class:`GeneratedEvaluatorRef`
instances for previously generated rubric evaluators. When
``None`` (default), uses smart defaults based on item data.
conversation_split: How to split multi-turn conversations into
query/response halves. Defaults to ``LAST_TURN``. Pass a
``ConversationSplit`` enum value or a custom callable — see
@@ -623,7 +833,7 @@ class FoundryEvals:
client: FoundryChatClient | None = None,
project_client: AIProjectClient | None = None,
model: str | None = None,
evaluators: Sequence[str] | None = None,
evaluators: Sequence[str | GeneratedEvaluatorRef] | None = None,
conversation_split: ConversationSplitter = ConversationSplit.LAST_TURN,
poll_interval: float = 5.0,
timeout: float = 180.0,
@@ -642,7 +852,9 @@ class FoundryEvals:
"Model is required. Pass model= explicitly or use a FoundryChatClient that has a model configured."
)
self._model = resolved_model
self._evaluators = list(evaluators) if evaluators is not None else None
self._evaluators: list[str | GeneratedEvaluatorRef] | None = (
list(evaluators) if evaluators is not None else None
)
self._conversation_split = conversation_split
self._poll_interval = poll_interval
self._timeout = timeout
@@ -678,7 +890,7 @@ class FoundryEvals:
async def _evaluate_via_dataset(
self,
items: Sequence[EvalItem],
evaluators: list[str],
evaluators: list[str | GeneratedEvaluatorRef],
eval_name: str,
) -> EvalResults:
"""Evaluate using JSONL dataset upload path."""
@@ -25,16 +25,25 @@ from agent_framework._evaluation import (
from agent_framework._workflows._workflow import WorkflowRunResult
from openai import AsyncOpenAI
from agent_framework_foundry import GeneratedEvaluatorRef
from agent_framework_foundry._foundry_evals import (
_AGENT_EVALUATORS,
_BUILTIN_EVALUATORS,
_TOOL_EVALUATORS,
FoundryEvals,
_build_item_schema,
_build_testing_criteria,
_extract_per_evaluator,
_extract_result_counts,
_extract_rubric_scores,
_fetch_output_items,
_filter_tool_evaluators,
_poll_eval_run,
_resolve_default_evaluators,
_resolve_evaluator,
_resolve_openai_client,
evaluate_foundry_target,
evaluate_traces,
)
@@ -806,6 +815,67 @@ class TestBuildTestingCriteria:
for c in criteria:
assert "tool_definitions" in c["data_mapping"], f"{c['name']} missing tool_definitions"
def test_generated_evaluator_ref_pinned_version(self) -> None:
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
criteria = _build_testing_criteria([ref], "gpt-4o", include_data_mapping=True)
assert len(criteria) == 1
c = criteria[0]
assert c["type"] == "azure_ai_evaluator"
assert c["evaluator_name"] == "my-rubric"
assert c["evaluator_version"] == "1"
assert c["name"] == "my-rubric"
assert c["initialization_parameters"] == {"deployment_name": "gpt-4o"}
assert c["data_mapping"] == {
"query": "{{item.query_messages}}",
"response": "{{item.response_messages}}",
}
def test_generated_evaluator_ref_display_name_used_as_short(self) -> None:
ref = GeneratedEvaluatorRef(name="my-rubric", version="2", display_name="My Rubric")
criteria = _build_testing_criteria([ref], "gpt-4o")
assert criteria[0]["name"] == "My Rubric"
assert criteria[0]["evaluator_name"] == "my-rubric"
def test_generated_evaluator_ref_tool_definitions_added(self) -> None:
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
criteria = _build_testing_criteria(
[ref],
"gpt-4o",
include_data_mapping=True,
include_tool_definitions=True,
)
assert criteria[0]["data_mapping"]["tool_definitions"] == "{{item.tool_definitions}}"
def test_generated_evaluator_ref_unpinned_warns(self, caplog: pytest.LogCaptureFixture) -> None:
import logging
ref = GeneratedEvaluatorRef.latest("my-rubric")
with caplog.at_level(logging.WARNING, logger="agent_framework_foundry._foundry_evals"):
criteria = _build_testing_criteria([ref], "gpt-4o")
assert "evaluator_version" not in criteria[0]
assert any("no pinned version" in r.message for r in caplog.records)
def test_generated_evaluator_ref_mixed_with_builtins(self) -> None:
ref = GeneratedEvaluatorRef(name="my-rubric", version="1")
criteria = _build_testing_criteria(
["relevance", ref, "task_adherence"],
"gpt-4o",
include_data_mapping=True,
)
assert [c["name"] for c in criteria] == ["relevance", "my-rubric", "task_adherence"]
assert criteria[0]["evaluator_name"] == "builtin.relevance"
assert criteria[1]["evaluator_name"] == "my-rubric"
assert criteria[2]["evaluator_name"] == "builtin.task_adherence"
# ---------------------------------------------------------------------------
# _build_item_schema
@@ -1263,6 +1333,29 @@ class TestFilterToolEvaluators:
items,
)
def test_preserves_generated_ref_when_no_tools(self) -> None:
ref = GeneratedEvaluatorRef(name="rubric", version="1")
items = [
EvalItem(conversation=[Message("user", ["q"]), Message("assistant", ["r"])]),
]
result = _filter_tool_evaluators(
["relevance", ref, "tool_call_accuracy"],
items,
)
assert "relevance" in result
assert ref in result
assert "tool_call_accuracy" not in result
def test_generated_ref_alone_does_not_raise(self) -> None:
ref = GeneratedEvaluatorRef(name="rubric", version="1")
items = [
EvalItem(conversation=[Message("user", ["q"]), Message("assistant", ["r"])]),
]
result = _filter_tool_evaluators([ref], items)
assert result == [ref]
# ---------------------------------------------------------------------------
# EvalResults
@@ -2267,7 +2360,6 @@ class TestEvalResultsWithItems:
class TestFetchOutputItems:
async def test_fetches_and_converts_output_items(self) -> None:
from agent_framework_foundry._foundry_evals import _fetch_output_items
# Build mock output items matching the OpenAI SDK schema
mock_result = MagicMock()
@@ -2329,7 +2421,6 @@ class TestFetchOutputItems:
assert item.error_code is None
async def test_handles_errored_item(self) -> None:
from agent_framework_foundry._foundry_evals import _fetch_output_items
mock_error = MagicMock()
mock_error.code = "QueryExtractionError"
@@ -2361,7 +2452,6 @@ class TestFetchOutputItems:
assert len(item.scores) == 0
async def test_handles_api_failure_gracefully(self) -> None:
from agent_framework_foundry._foundry_evals import _fetch_output_items
mock_client = MagicMock()
mock_client.evals.runs.output_items.list = AsyncMock(side_effect=TypeError("API error"))
@@ -2369,6 +2459,166 @@ class TestFetchOutputItems:
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
assert items == []
async def test_extracts_rubric_scores_from_dict_sample(self) -> None:
mock_result = MagicMock()
mock_result.name = "my-rubric"
mock_result.score = 0.85
mock_result.passed = True
mock_result.sample = {
"properties": {
"rubric_scores": [
{"id": "policy", "score": 4, "applicable": True, "weight": 1, "reason": "ok"},
{"id": "safety", "score": None, "applicable": False, "weight": 1, "reason": "n/a"},
]
}
}
mock_oi = MagicMock()
mock_oi.id = "oi_1"
mock_oi.status = "pass"
mock_oi.results = [mock_result]
mock_oi.sample = None
mock_oi.datasource_item = {}
mock_client = MagicMock()
mock_client.evals.runs.output_items.list = AsyncMock(return_value=_AsyncPage([mock_oi]))
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
assert len(items) == 1
scores = items[0].scores
assert len(scores) == 1
assert scores[0].dimensions is not None
assert len(scores[0].dimensions) == 2
policy = next(d for d in scores[0].dimensions if d.id == "policy")
assert policy.score == 4
assert policy.applicable is True
assert policy.weight == 1
assert policy.reason == "ok"
safety = next(d for d in scores[0].dimensions if d.id == "safety")
assert safety.score is None
assert safety.applicable is False
async def test_no_rubric_scores_when_absent(self) -> None:
mock_result = MagicMock()
mock_result.name = "relevance"
mock_result.score = 0.85
mock_result.passed = True
mock_result.sample = None
mock_oi = MagicMock()
mock_oi.id = "oi_2"
mock_oi.status = "pass"
mock_oi.results = [mock_result]
mock_oi.sample = None
mock_oi.datasource_item = {}
mock_client = MagicMock()
mock_client.evals.runs.output_items.list = AsyncMock(return_value=_AsyncPage([mock_oi]))
items = await _fetch_output_items(mock_client, "eval_1", "run_1")
assert items[0].scores[0].dimensions is None
class TestExtractRubricScores:
def test_handles_attribute_style_properties(self) -> None:
rs = MagicMock()
rs.id = "policy"
rs.score = 5
rs.applicable = True
rs.weight = 2
rs.reason = "ok"
sample = MagicMock()
sample.properties = MagicMock()
sample.properties.rubric_scores = [rs]
result = _extract_rubric_scores(sample)
assert result is not None
assert result[0].id == "policy"
assert result[0].score == 5
assert result[0].weight == 2
def test_top_level_rubric_scores_in_dict(self) -> None:
sample = {"rubric_scores": [{"id": "a", "score": 3, "applicable": True, "weight": 1, "reason": "r"}]}
result = _extract_rubric_scores(sample)
assert result is not None
assert result[0].id == "a"
def test_returns_none_when_missing(self) -> None:
assert _extract_rubric_scores(None) is None
assert _extract_rubric_scores({}) is None
assert _extract_rubric_scores({"properties": {}}) is None
def test_skips_malformed_entries(self) -> None:
sample = {
"properties": {
"rubric_scores": [
{"id": "good", "score": 3, "applicable": True, "weight": 1, "reason": "ok"},
{"id": "bad-no-weight", "score": 2, "applicable": True, "reason": "x"},
]
}
}
result = _extract_rubric_scores(sample)
assert result is not None
assert len(result) == 1
assert result[0].id == "good"
def test_canonical_dimension_scores_key_from_docs(self) -> None:
"""Per the Microsoft Learn docs, runtime output uses ``properties.dimension_scores``."""
sample = {
"properties": {
"dimension_scores": [
{
"id": "intent_recognition",
"score": 5,
"applicable": True,
"weight": 9,
"reason": "Identified correctly.",
},
{
"id": "general_quality",
"score": 4,
"applicable": True,
"weight": 5,
"reason": "Strong overall.",
},
]
}
}
result = _extract_rubric_scores(sample)
assert result is not None
assert [r.id for r in result] == ["intent_recognition", "general_quality"]
assert [r.score for r in result] == [5, 4]
assert [r.weight for r in result] == [9, 5]
def test_dimension_scores_via_attribute(self) -> None:
"""Canonical key also resolves when properties exposes ``dimension_scores`` as an attr."""
rs = MagicMock()
rs.id = "policy_enforcement"
rs.score = 1
rs.applicable = True
rs.weight = 5
rs.reason = "violated"
sample = MagicMock()
sample.properties = MagicMock(spec=["dimension_scores"])
sample.properties.dimension_scores = [rs]
result = _extract_rubric_scores(sample)
assert result is not None
assert result[0].id == "policy_enforcement"
assert result[0].score == 1
# ---------------------------------------------------------------------------
# _poll_eval_run — timeout / failed / canceled paths
@@ -2378,7 +2628,6 @@ class TestFetchOutputItems:
class TestPollEvalRun:
async def test_timeout_returns_timeout_status(self) -> None:
"""Poll timeout returns EvalResults with status='timeout'."""
from agent_framework_foundry._foundry_evals import _poll_eval_run
mock_client = MagicMock()
mock_pending = MagicMock()
@@ -2392,7 +2641,6 @@ class TestPollEvalRun:
async def test_failed_run_returns_error(self) -> None:
"""Failed run returns EvalResults with error message."""
from agent_framework_foundry._foundry_evals import _poll_eval_run
mock_client = MagicMock()
mock_failed = MagicMock()
@@ -2410,7 +2658,6 @@ class TestPollEvalRun:
async def test_canceled_run_returns_canceled_status(self) -> None:
"""Canceled run returns EvalResults with status='canceled'."""
from agent_framework_foundry._foundry_evals import _poll_eval_run
mock_client = MagicMock()
mock_canceled = MagicMock()
@@ -2435,7 +2682,6 @@ class TestPollEvalRun:
class TestEvaluateTraces:
async def test_raises_without_required_args(self) -> None:
"""Raises ValueError when no response_ids, trace_ids, or agent_id given."""
from agent_framework_foundry._foundry_evals import evaluate_traces
mock_client = MagicMock()
with pytest.raises(ValueError, match="Provide at least one of"):
@@ -2446,7 +2692,6 @@ class TestEvaluateTraces:
async def test_response_ids_path(self) -> None:
"""evaluate_traces with response_ids uses the responses API path."""
from agent_framework_foundry._foundry_evals import evaluate_traces
mock_client = MagicMock()
@@ -2494,7 +2739,6 @@ class TestEvaluateTraces:
async def test_trace_ids_path(self) -> None:
"""evaluate_traces with trace_ids builds azure_ai_traces data source."""
from agent_framework_foundry._foundry_evals import evaluate_traces
mock_client = MagicMock()
@@ -2534,7 +2778,6 @@ class TestEvaluateTraces:
class TestEvaluateFoundryTarget:
async def test_happy_path(self) -> None:
"""evaluate_foundry_target creates eval + run and polls to completion."""
from agent_framework_foundry._foundry_evals import evaluate_foundry_target
mock_client = MagicMock()
@@ -2670,13 +2913,11 @@ class TestEvaluatorSetConsistency:
"""Verify that _AGENT_EVALUATORS and _TOOL_EVALUATORS are subsets of _BUILTIN_EVALUATORS."""
def test_agent_evaluators_subset(self):
from agent_framework_foundry._foundry_evals import _AGENT_EVALUATORS, _BUILTIN_EVALUATORS
diff = _AGENT_EVALUATORS - set(_BUILTIN_EVALUATORS.values())
assert not diff, f"_AGENT_EVALUATORS has names not in _BUILTIN_EVALUATORS: {diff}"
def test_tool_evaluators_subset(self):
from agent_framework_foundry._foundry_evals import _BUILTIN_EVALUATORS, _TOOL_EVALUATORS
diff = _TOOL_EVALUATORS - set(_BUILTIN_EVALUATORS.values())
assert not diff, f"_TOOL_EVALUATORS has names not in _BUILTIN_EVALUATORS: {diff}"
@@ -2690,7 +2931,6 @@ class TestEvaluatorSetConsistency:
class TestEvaluateTracesAgentId:
async def test_agent_id_only_path(self) -> None:
"""evaluate_traces with agent_id only builds azure_ai_traces data source."""
from agent_framework_foundry._foundry_evals import evaluate_traces
mock_client = MagicMock()
@@ -2748,7 +2988,6 @@ class TestFilterToolEvaluatorsRaises:
class TestEvaluateFoundryTargetValidation:
async def test_target_without_type_raises(self) -> None:
"""target dict without 'type' key raises ValueError."""
from agent_framework_foundry._foundry_evals import evaluate_foundry_target
mock_client = MagicMock()
with pytest.raises(ValueError, match="'type' key"):