Files
agent-framework/python/packages/foundry
T
Ben Thomas e0d0ad16a0 Python: feat(evals): Foundry Adaptive Evals integration (rubric-generation) (#6101)
* Python: feat(evals): RubricScore type + EvalScoreResult.dimensions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): RubricDimension + GeneratedEvaluatorRef + accept in evaluators=

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): parse rubric_scores from output items + assertion helpers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(evals): BaseAgent.as_eval_source / Workflow.as_eval_source

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): EvalGenerationSource + generate_rubric helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): YAML config loader + sample

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: fix(evals): address PR review feedback

Addresses 4 Copilot review comments on PR #6101:

1. assert_dimension_score_at_least: drop the (not evaluator or found_any) guard so require_applicable=True correctly raises when the named evaluator produces no entries for the dimension. Adds TestRubricAssertions covering the regression.

2. GeneratedEvaluatorRef docstring: reword to describe actual behaviour (pinning recommended, not required) so it matches the dataclass default and FoundryEvals warning path.

3. _poll_generation_job: switch from asyncio.get_event_loop() to get_running_loop() and bound the per-iteration sleep by remaining time, matching _poll_eval_run.

4. generate_rubric: type category as Literal['quality','safety'] and validate at the entry point with a ValueError; drop the silent 'invalid -> quality' rewrite in _generation_job_to_ref. Adds a regression test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Python: feat(foundry-evals): hosted-agent-aware rubric generation

* Auto-detect hosted Foundry agents in agent_as_eval_source: when the
  agent's chat_client exposes a string agent_name (the convention used
  by RawFoundryAgentChatClient for PromptAgents/HostedAgents), emit a
  type='agent' EvalGenerationSource so the service fetches instructions
  and tools from the agent registry instead of relying on the local
  wrapper (which holds neither for hosted agents).
* Add hosted_agent_version kwarg and a new agent_version field on
  EvalGenerationSource so PromptAgent runs can pin to a specific hosted
  version for reproducible rubric generation.
* Add force_prompt_source escape hatch to bypass auto-detection and
  always emit a rendered prompt dossier - useful when the local wrapper
  carries overrides the service-side agent doesnt see.
* Fix _to_sdk_source for dataset sources: SDK ctor takes name=/version=,
  not dataset_name=/dataset_version=. The mismatch would raise TypeError
  against the real azure-ai-projects 2.3.0a* SDK; only unmocked
  integration paths were affected.

Tests cover: auto-detection happy path, versionless hosted agent,
explicit hosted_agent_version forwarding, force_prompt_source override,
non-string chat_client attrs (MagicMock test doubles) not mis-detected,
agent_version forwarded through _to_sdk_source, and the corrected
dataset SDK kwarg names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): accept canonical dimension_scores key per docs

The published Foundry rubric-evaluator output (Microsoft Learn 'Rubric evaluators' reference) places per-dimension breakdowns under properties.dimension_scores, not properties.rubric_scores. The parser now tries dimension_scores first and falls back to rubric_scores for preview-build compatibility, and tolerates non-list payloads (e.g. MagicMock auto-attrs) by trying the next candidate when parsing yields zero entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): add manual create_rubric_evaluator

Adds FoundryEvals.create_rubric_evaluator as the agent-framework surface over project_client.beta.evaluators.create_version. This is the manual counterpart to generate_rubric: callers supply RubricDimension instances (authored locally, ported from another framework, or hand-tuned) and we POST a RubricBasedEvaluatorDefinition. The service auto-attaches the non-editable residual dimension (general_quality for quality, general_policy_compliance for safety).

Per the Microsoft Learn 'Rubric evaluators' reference, the auto-generation path (create_generation_job) is primarily a portal/UI feature; external SDK clients with rich local agent context are better served by manual create_version. This keeps generate_rubric for users who want to round-trip through a Foundry-registered agent.

Validation up front: weight must be in [1,10], ids unique, descriptions non-empty, pass_threshold in [0,1]. The returned GeneratedEvaluatorRef is identical in shape to one obtained from generate_rubric, so downstream evaluators= lists work unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): manual rubric sample + namespace re-exports

Adds evaluate_with_manual_rubric_sample.py demonstrating the end-to-end dev scenario for FoundryEvals.create_rubric_evaluator: hand-author a list of RubricDimension, register via create_rubric_evaluator, then use the pinned GeneratedEvaluatorRef alongside built-in evaluators in an agent regression run.

Also re-exports RubricDimension, GeneratedEvaluatorRef, build_sources, and load_evals_config from agent_framework.foundry (both the lazy runtime shim and the type stub) so the rubric samples can import everything from a single namespace; the auto-generate sample was previously broken because the shim was missing build_sources / load_evals_config.

Updates the foundry-evals README with a chooser entry for the two rubric paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(foundry-evals): remove rubric creation flows; keep consumption only

Reframes agent-framework as a pure consumer of Foundry rubric evaluators: scoring against rubrics that already exist (authored in the Foundry portal or via the dedicated SDK / REST surface) instead of creating them from the SDK.

Removed creation surface area:

- FoundryEvals.generate_rubric (auto-generate path) and create_rubric_evaluator (manual path), plus all _GenerationSdkTypes / _ManualRubricSdkTypes / _to_sdk_dimensions / _coalesce_generation_sources / _to_sdk_source / _poll_generation_job / _generation_job_to_ref / _evaluator_version_to_ref / _get_beta_evaluators / _import_*_sdk_types helpers.

- EvalGenerationSource (the input source discriminator), RubricDimension (the input dimension type), agent_as_eval_source / workflow_as_eval_source / _detect_hosted_foundry_agent helpers, and the YAML-config loader (_evals_config.py with RubricGenerationSpec / RubricSourceSpec / parse_evals_config / load_evals_config / build_sources).

- BaseAgent.as_eval_source / Workflow.as_eval_source plus the _render_agent_dossier / _render_workflow_dossier helpers in core. These existed only to feed the now-removed generation pipeline.

- Samples evaluate_with_generated_rubric_sample.py, evaluate_with_manual_rubric_sample.py, and evaluators.yaml. Replaced with a short README section showing how to reference an existing rubric evaluator via GeneratedEvaluatorRef.

Kept (consumption surface):

- GeneratedEvaluatorRef, slimmed to (name, version, display_name). Still accepted alongside built-in evaluator strings in FoundryEvals(evaluators=[...]). Versionless refs still warn.

- RubricScore on EvalScoreResult.dimensions plus EvalResults.assert_dimension_score_at_least for per-dimension CI gates.

- _parse_dimension_entries / _extract_rubric_scores output parsing (both canonical dimension_scores and the legacy rubric_scores key).

Tests: 160/160 foundry unit tests and 71/71 core local-eval tests pass; pyright is clean across changed files. The pre-existing tests/core/test_telemetry.py::test_detect_hosted_fallback_import_error failure is unrelated and reproduces on the prior commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* samples(foundry-evals): add evaluate_with_rubric_sample

Adds a runnable end-to-end sample showing how to consume a pre-existing rubric evaluator created in Foundry: reference it with GeneratedEvaluatorRef(name, version), mix it with built-in evaluators in FoundryEvals, and gate CI with assert_dimension_score_at_least on a specific dimension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(foundry-evals): satisfy mypy on _fetch_output_items

mypy infers OutputItemListResponse.sample as dict[str, object] | None while pyright correctly infers the typed Sample model. Cast to Any so both type checkers accept the attribute access pattern, rename the local to avoid shadowing the inner-loop sample binding, and drop the now-stale pyright suppressions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(foundry-evals): drop unpublished rubric-evaluators learn.microsoft.com link

The Adaptive Evals authoring docs are not yet published on Microsoft Learn, so the link 404s. Keep the descriptive text without the broken hyperlink; we can re-add it once the docs ship.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(foundry-evals): hoist repeated local imports to module top

Per code review feedback (eavanvalkenburg): the test file repeated 'from agent_framework_foundry._foundry_evals import ...' inside 22 test bodies and 'from agent_framework_foundry import GeneratedEvaluatorRef' inside 8 more. Move all of them to the existing top-level imports; the symbols are the same across tests and the local imports were redundant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Ben Thomas <25218250+alliscode@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e0d0ad16a0 · 2026-06-01 23:01:56 +00:00
History
..

Agent Framework Foundry

This package contains the Microsoft Foundry integrations for Microsoft Agent Framework, including Foundry chat clients, preconfigured Foundry agents, Foundry embedding clients, and Foundry memory providers.

Toolboxes

A toolbox is a named, versioned bundle of hosted tool configurations — code interpreter, file search, image generation, MCP, web search, and so on — stored inside a Microsoft Foundry project. Toolboxes let you manage tool configuration once and reuse it across agents.

Authoring a toolbox

Toolboxes can be authored two ways:

  • Foundry portal — create and version toolboxes through the UI without touching code.
  • Programmatically — use the azure-ai-projects SDK to create, update, and version toolboxes from Python.

Toolbox authoring APIs (ToolboxVersionObject, ToolboxObject, project_client.beta.toolboxes.*) require azure-ai-projects>=2.1.0. Earlier versions can only consume toolboxes that already exist.

Using toolboxes with FoundryAgent

For hosted FoundryAgent, the toolbox must already be attached to the agent in the Microsoft Foundry project. Once attached, the agent invokes its toolbox tools transparently — no client-side wiring required — and you interact with the agent the same way you would with any other tool-equipped Foundry agent.

Using toolboxes with FoundryChatClient

Each toolbox is reachable as an MCP server. Connect to the toolbox's MCP endpoint with MCPStreamableHTTPTool — the agent then discovers and calls its tools over MCP at runtime:

from agent_framework import Agent, MCPStreamableHTTPTool
from agent_framework.foundry import FoundryChatClient

async with Agent(
    client=FoundryChatClient(...),
    instructions="You are a helpful assistant. Use the toolbox tools when useful.",
    tools=MCPStreamableHTTPTool(
        name="my_toolbox",
        description="Tools served by my Foundry toolbox",
        url="https://<your-toolbox-mcp-endpoint>",
    ),
) as agent:
    result = await agent.run("What tools are available?")
    print(result.text)

Hosted tool factories

FoundryChatClient exposes static factory methods that return Foundry SDK tool configurations ready to pass to an Agent's tools=[...] argument. These factories don't require a FoundryChatClient instance — you can call them statically and reuse the same tool configuration across agents.

from agent_framework import Agent
from agent_framework.foundry import FoundryChatClient

agent = Agent(
    client=FoundryChatClient(...),
    instructions="...",
    tools=[
        FoundryChatClient.get_web_search_tool(),
        FoundryChatClient.get_code_interpreter_tool(),
    ],
)

Generally available factories: get_code_interpreter_tool, get_file_search_tool, get_web_search_tool, get_image_generation_tool, get_mcp_tool.

Choosing a web grounding tool. get_web_search_tool is the recommended default — it requires no separate Bing resource and works with Azure OpenAI models out of the box. Reach for get_bing_grounding_tool (experimental, see below) when you need finer Bing parameters (count, freshness, market, set_lang), are grounding non-OpenAI Foundry models, or are migrating from Grounding with Bing Search on the classic platform — it requires a Grounding with Bing Search Azure resource that you manage. get_bing_custom_search_tool (also experimental) is for grounding restricted to a curated list of domains via a Bing Custom Search instance. See the web grounding overview for the full comparison.

Experimental — ExperimentalFeature.FOUNDRY_TOOLS. The following factories wrap GA Foundry tool SDK classes but are new wrappers in agent-framework-foundry and may change before the wrappers themselves reach GA. Calls emit an ExperimentalWarning the first time the FOUNDRY_TOOLS feature is exercised in a process (then deduplicated).

Factory Foundry SDK tool
get_azure_ai_search_tool(index_connection_id, index_name, ...) AzureAISearchTool
get_bing_grounding_tool(connection_id, ...) BingGroundingTool

Experimental — ExperimentalFeature.FOUNDRY_PREVIEW_TOOLS. The following factories wrap preview Foundry tool SDK types — the underlying Foundry capability itself is in preview and may change or be removed before reaching GA. Calls emit a separate ExperimentalWarning the first time the FOUNDRY_PREVIEW_TOOLS feature is exercised in a process (then deduplicated). Use FOUNDRY_TOOLS for "wrapper is new" and FOUNDRY_PREVIEW_TOOLS for "underlying Foundry feature is preview".

Factory Foundry SDK tool
get_sharepoint_tool(connection_id) SharepointPreviewTool
get_fabric_tool(connection_id) MicrosoftFabricPreviewTool
get_memory_search_tool(memory_store_name, scope, ...) MemorySearchPreviewTool
get_computer_use_tool(environment, display_width, display_height) ComputerUsePreviewTool
get_browser_automation_tool(connection_id) BrowserAutomationPreviewTool
get_bing_custom_search_tool(connection_id, instance_name, ...) BingCustomSearchPreviewTool
get_a2a_tool(base_url=..., project_connection_id=..., ...) A2APreviewTool

Publishing an agent as a Foundry prompt agent

Experimental — ExperimentalFeature.TO_PROMPT_AGENT. to_prompt_agent is a preview API and may change before reaching GA. The warning fires the first time the TO_PROMPT_AGENT feature is exercised in a process and is then deduplicated.

to_prompt_agent(agent) converts an Agent whose chat client is a FoundryChatClient into a Foundry PromptAgentDefinition that can be published with AIProjectClient.agents.create_version(...). The model is read from default_options["model"] first and falls back to the bound FoundryChatClient.model (matching Agent.__init__'s resolution order), so the same agent definition you run locally can be published as a hosted prompt agent without restating the model deployment name.

Every generation parameter that has an Agent Framework equivalent is sourced from agent.default_options and translated into the matching Foundry shape by _prepare_prompt_agent_options (a module-private helper in agent_framework_foundry._to_prompt_agent that reuses the chat client's own request-path helpers):

default_options key PromptAgentDefinition field
temperature temperature
top_p top_p
tool_choice (dropped when no tools) tool_choice (str / ToolChoiceFunction / ToolChoiceAllowed)
reasoning (dict or Reasoning) reasoning
response_format (dict or BaseModel) text.format
verbosity text.verbosity
text merged into text

This keeps the Agent as the single source of truth for everything it can already express. Only Foundry-specific fields with no Agent Framework equivalent are accepted as keyword arguments on to_prompt_agent:

  • structured_inputsdict[str, StructuredInputDefinition]
  • rai_configRaiConfig
import asyncio
import os

from agent_framework import Agent
from agent_framework.foundry import FoundryChatClient, to_prompt_agent
from azure.ai.projects.aio import AIProjectClient
from azure.identity.aio import AzureCliCredential


async def main() -> None:
    credential = AzureCliCredential()
    project_endpoint = os.environ["FOUNDRY_PROJECT_ENDPOINT"]

    agent = Agent(
        client=FoundryChatClient(
            project_endpoint=project_endpoint,
            model="gpt-4o",
            credential=credential,
        ),
        name="travel-agent",
        description="Helps Contoso employees book travel.",
        instructions="You are a helpful travel assistant.",
        tools=[
            FoundryChatClient.get_web_search_tool(),
            FoundryChatClient.get_code_interpreter_tool(),
        ],
        # Generation parameters set on the Agent flow through automatically.
        default_options={
            "temperature": 0.3,
            "top_p": 0.95,
            "reasoning": {"effort": "medium"},
        },
    )

    definition = to_prompt_agent(agent)

    project_client = AIProjectClient(endpoint=project_endpoint, credential=credential)
    created = await project_client.agents.create_version(
        agent_name=agent.name,
        definition=definition,
        description=agent.description,
    )
    print(f"Published {created.name} v{created.version}")


asyncio.run(main())

Behaviour:

  • agent.client must be a FoundryChatClient (or subclass) — otherwise the converter raises TypeError.

  • The bound client must have a model set — otherwise the converter raises ValueError.

  • Foundry SDK tool instances returned by FoundryChatClient.get_*_tool() are passed through unchanged.

  • AF FunctionTool instances (and @tool-decorated callables) are emitted as Foundry FunctionTool declarations — the prompt agent receives the schema only, not the Python implementation. To execute the function when invoking the deployed prompt agent, connect with FoundryAgent and pass the same callable via tools=:

    from agent_framework.foundry import FoundryAgent
    
    deployed = FoundryAgent(
        project_endpoint=project_endpoint,
        agent_name="travel-agent",
        credential=credential,
        tools=[book_hotel],  # same @tool-decorated callable used at publish time
    )
    result = await deployed.run("Book me a hotel in Seattle for 3 nights.")
    

    FoundryAgent runs the function locally when the prompt agent calls it, so the declaration on the server and the implementation on the client stay in sync via the shared @tool definition.

  • Local Agent Framework MCP tools cannot be published as prompt-agent tools — the converter raises ValueError and points at FoundryChatClient.get_mcp_tool(...) for hosted MCP servers.

See the runnable example under samples/02-agents/providers/foundry/:

  • foundry_prompt_agents.py — publish with to_prompt_agent, then connect back with FoundryAgent and execute the same local @tool callable that the deployed prompt agent invokes by name.