Python: [BREAKING] Refactor middleware layering and split Anthropic raw client (#4746)

* [BREAKING] Refactor middleware layering and raw clients Reorder chat client layers so function invocation wraps chat middleware, and chat middleware stays outside telemetry while still running for each inner model call. Add middleware pipeline caching, refresh docs and samples, and split Anthropic into raw and public clients to match the standard layering model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten typing ignores in ancillary modules Add targeted typing ignores in workflow visualization and lab modules so pyright stays clean alongside the middleware refactor work. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix categorize_middleware to unpack tuple/Sequence and use relative MRO assertions - Broaden isinstance check in categorize_middleware from list to Sequence so tuples and other Sequence types are properly unpacked instead of being appended as a single item. - Replace fragile hardcoded MRO index assertions in anthropic test with relative ordering via mro.index(). - Add regression tests for categorize_middleware with tuple, list, and None inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix middleware string decomposition, add middleware param to FunctionInvocationLayer, and add tests (#4710) - Guard categorize_middleware Sequence check against str/bytes to prevent character-by-character decomposition of accidentally passed strings - Add explicit middleware parameter to FunctionInvocationLayer.get_response and merge it into client_kwargs before categorization, fixing the inconsistency where only OpenAIChatClient supported this parameter - Add assertions that RawAnthropicClient does not inherit convenience layers - Add chat middleware cache test with non-empty base middleware - Add tests for single unwrapped middleware item and string input Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply pre-commit auto-fixes * Apply pre-commit auto-fixes * Address review feedback for #4710: review comment fixes --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
2026-06-16 21:04:09 +08:00 · 2026-03-20 01:43:37 +01:00
parent cefda44283
commit 0cd40f8354
41 changed files with 936 additions and 155 deletions
@@ -0,0 +1,37 @@
+# Middleware samples
+
+This folder contains focused middleware samples for `Agent`, chat clients, tools, sessions, and runtime context behavior.
+
+## Files
+
+| File | Description |
+|------|-------------|
+| [`agent_and_run_level_middleware.py`](./agent_and_run_level_middleware.py) | Demonstrates combining agent-level and run-level middleware. |
+| [`chat_middleware.py`](./chat_middleware.py) | Shows class-based and function-based chat middleware that can observe, modify, and override model calls. |
+| [`class_based_middleware.py`](./class_based_middleware.py) | Shows class-based agent and function middleware. |
+| [`decorator_middleware.py`](./decorator_middleware.py) | Demonstrates middleware registration with decorators. |
+| [`exception_handling_with_middleware.py`](./exception_handling_with_middleware.py) | Shows how middleware can handle failures and recover cleanly. |
+| [`function_based_middleware.py`](./function_based_middleware.py) | Shows function-based agent and function middleware. |
+| [`middleware_termination.py`](./middleware_termination.py) | Demonstrates stopping a middleware pipeline early. |
+| [`override_result_with_middleware.py`](./override_result_with_middleware.py) | Shows how middleware can replace the normal result. |
+| [`runtime_context_delegation.py`](./runtime_context_delegation.py) | Demonstrates delegating work with runtime context data. |
+| [`session_behavior_middleware.py`](./session_behavior_middleware.py) | Shows how middleware interacts with session-backed runs. |
+| [`shared_state_middleware.py`](./shared_state_middleware.py) | Demonstrates sharing mutable state across middleware invocations. |
+| [`usage_tracking_middleware.py`](./usage_tracking_middleware.py) | Demonstrates one chat middleware function that tracks per-call usage in non-streaming and streaming tool-loop runs. |
+
+## Running the usage tracking sample
+
+The new usage tracking sample uses `OpenAIResponsesClient`, so set the usual OpenAI responses environment variables first:
+
+```bash
+export OPENAI_API_KEY="your-openai-api-key"
+export OPENAI_RESPONSES_MODEL_ID="gpt-4.1-mini"
+```
+
+Then run:
+
+```bash
+uv run samples/02-agents/middleware/usage_tracking_middleware.py
+```
+
+The sample forces a tool call so you can see middleware output for each inner model call in both non-streaming and streaming modes.
@@ -51,10 +51,10 @@ Agent Middleware Execution Order:
    - Run middleware wraps only the agent for that specific run
    - Each middleware can modify the context before AND after calling next()

-    Note: Function and chat middleware (e.g., ``function_logging_middleware``) execute
-    during tool invocation *inside* the agent execution, not in the outer agent-middleware
-    chain shown above. They follow the same ordering principle: agent-level function/chat
-    middleware runs before run-level function/chat middleware.
+    Note: Function middleware executes during tool invocation, and chat middleware
+    executes around each model call inside the agent execution, not in the outer
+    agent-middleware chain shown above. They follow the same ordering principle:
+    agent-level function/chat middleware runs before run-level function/chat middleware.
 """


@@ -0,0 +1,185 @@
+# Copyright (c) Microsoft. All rights reserved.
+
+"""
+This sample demonstrates a single chat middleware that tracks per-model-call usage
+for both non-streaming and streaming tool-loop runs.
+"""
+
+import asyncio
+from collections.abc import Awaitable, Callable
+from random import randint
+from typing import Annotated
+
+from agent_framework import (
+    Agent,
+    ChatContext,
+    ChatResponse,
+    ChatResponseUpdate,
+    ResponseStream,
+    chat_middleware,
+    tool,
+)
+from agent_framework.openai import OpenAIResponsesClient
+from dotenv import load_dotenv
+from pydantic import Field
+
+# Load environment variables from .env file
+load_dotenv()
+
+
+NON_STREAMING_CALL_COUNT = 0
+STREAMING_CALL_COUNT = 0
+
+
+# NOTE: approval_mode="never_require" is for sample brevity. Use "always_require" in production;
+# see samples/02-agents/tools/function_tool_with_approval.py
+# and samples/02-agents/tools/function_tool_with_approval_and_sessions.py.
+@tool(approval_mode="never_require")
+def get_weather(
+    location: Annotated[str, Field(description="The location to get the weather for.")],
+) -> str:
+    """Get the weather for a given location."""
+    conditions = ["sunny", "cloudy", "rainy", "stormy"]
+    return f"The weather in {location} is {conditions[randint(0, 3)]} with a high of {randint(10, 30)}°C."
+
+
+def _reset_usage_counters() -> None:
+    """Reset call counters between sample runs."""
+    global NON_STREAMING_CALL_COUNT, STREAMING_CALL_COUNT
+    NON_STREAMING_CALL_COUNT = 0
+    STREAMING_CALL_COUNT = 0
+
+
+def _create_agent(
+) -> Agent:
+    """Create the shared agent used by both demonstrations."""
+    return Agent(
+        client=OpenAIResponsesClient(),
+        instructions=(
+            "You are a weather assistant. Always call the weather tool before answering weather questions, "
+            "then summarize the tool result in one short paragraph."
+        ),
+        tools=[get_weather],
+        middleware=[print_usage],
+    )
+
+
+@chat_middleware
+async def print_usage(
+    context: ChatContext,
+    call_next: Callable[[], Awaitable[None]],
+) -> None:
+    """Print usage for each inner model call in both non-streaming and streaming runs."""
+    global NON_STREAMING_CALL_COUNT, STREAMING_CALL_COUNT
+
+    if context.stream:
+        STREAMING_CALL_COUNT += 1
+        call_number = STREAMING_CALL_COUNT
+        usage_seen_in_updates = False
+
+        def capture_usage_update(update: ChatResponseUpdate) -> ChatResponseUpdate:
+            nonlocal usage_seen_in_updates
+
+            for content in update.contents:
+                if content.type == "usage":
+                    usage_seen_in_updates = True
+                    print(f"\n[Streaming model call #{call_number}] Usage update: {content.usage_details}")
+            return update
+
+        def capture_final_usage(result: ChatResponse) -> ChatResponse:
+            if not usage_seen_in_updates and result.usage_details:
+                print(f"\n[Streaming model call #{call_number}] Final usage: {result.usage_details}")
+            return result
+
+        context.stream_transform_hooks.append(capture_usage_update)
+        context.stream_result_hooks.append(capture_final_usage)
+        await call_next()
+        return
+
+    NON_STREAMING_CALL_COUNT += 1
+    call_number = NON_STREAMING_CALL_COUNT
+
+    await call_next()
+
+    response = context.result
+    if isinstance(response, ChatResponse) and response.usage_details:
+        print(f"[Non-streaming model call #{call_number}] Usage: {response.usage_details}")
+
+
+async def non_streaming_usage_example() -> None:
+    """Run the non-streaming usage tracking example."""
+    _reset_usage_counters()
+    print("\n=== Non-streaming per-call usage tracking ===")
+
+    # 1. Create an agent with middleware that prints usage after each inner model call.
+    agent = _create_agent()
+
+    # 2. Run a weather question and require a tool call so the function loop performs multiple model calls.
+    query = "What is the weather in Seattle, and should I bring an umbrella?"
+    print(f"User: {query}")
+    result = await agent.run(
+        query,
+        options={"tool_choice": "required"},
+    )
+
+    # 3. Print the final user-visible answer after the middleware already logged per-call usage.
+    print(f"Assistant: {result.text}")
+
+
+async def streaming_usage_example() -> None:
+    """Run the streaming usage tracking example."""
+    _reset_usage_counters()
+    print("\n=== Streaming per-call usage tracking ===")
+
+    # 1. Create an agent with middleware that watches streaming usage for each inner model call.
+    agent = _create_agent()
+
+    # 2. Start a streaming run and force tool usage so the function loop performs multiple model calls.
+    query = "What is the weather in Portland, and should I bring a jacket?"
+    print(f"User: {query}")
+    print("Assistant: ", end="", flush=True)
+    stream: ResponseStream = agent.run(
+        query,
+        stream=True,
+        options={"tool_choice": "required"},
+    )
+
+    # 3. Consume the stream normally while the middleware reports usage in the background.
+    async for update in stream:
+        if update.text:
+            print(update.text, end="", flush=True)
+    print()
+
+    # 4. Finalize the stream so you can inspect the final response if needed.
+    final_response = await stream.get_final_response()
+    print(f"Final assistant message: {final_response.text}")
+
+
+async def main() -> None:
+    """Run both usage tracking demonstrations."""
+    print("=== Usage Tracking Middleware Example ===")
+
+    await non_streaming_usage_example()
+    await streaming_usage_example()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+
+"""
+Sample output:
+=== Usage Tracking Middleware Example ===
+
+=== Non-streaming per-call usage tracking ===
+User: What is the weather in Seattle, and should I bring an umbrella?
+[Non-streaming model call #1] Usage: {'input_tokens': ..., 'output_tokens': ..., ...}
+[Non-streaming model call #2] Usage: {'input_tokens': ..., 'output_tokens': ..., ...}
+Assistant: Based on the weather in Seattle, ...
+
+=== Streaming per-call usage tracking ===
+User: What is the weather in Portland, and should I bring a jacket?
+Assistant: Based on the weather in Portland, ...
+[Streaming model call #1] Usage update: {'input_tokens': ..., 'output_tokens': ..., ...}
+[Streaming model call #2] Usage update: {'input_tokens': ..., 'output_tokens': ..., ...}
+Final assistant message: Based on the weather in Portland, ...
+"""