# Tools and Middleware: Request Flow Architecture This document describes the complete request flow when using an Agent with middleware and tools, from the initial `Agent.run()` call through middleware layers, function invocation, and back to the caller. ## Overview The Agent Framework uses a layered architecture with three distinct middleware/processing layers: 1. **Agent Middleware Layer** - Wraps the entire agent execution 2. **Chat Middleware Layer** - Wraps calls to the chat client 3. **Function Middleware Layer** - Wraps individual tool/function invocations Each layer provides interception points where you can modify inputs, inspect outputs, or alter behavior. ## Flow Diagram ```mermaid sequenceDiagram participant User participant Agent as Agent.run() participant AML as AgentMiddlewareLayer participant AMP as AgentMiddlewarePipeline participant RawAgent as RawChatAgent.run() participant CML as ChatMiddlewareLayer participant CMP as ChatMiddlewarePipeline participant FIL as FunctionInvocationLayer participant Client as BaseChatClient._inner_get_response() participant LLM as LLM Service participant FMP as FunctionMiddlewarePipeline participant Tool as FunctionTool.invoke() User->>Agent: run(messages, thread, options, middleware) Note over Agent,AML: Agent Middleware Layer Agent->>AML: run() with middleware param AML->>AML: categorize_middleware() → split by type AML->>AMP: execute(AgentContext) loop Agent Middleware Chain AMP->>AMP: middleware[i].process(context, call_next) Note right of AMP: Can modify: messages, options, thread end AMP->>RawAgent: run() via final_handler alt Non-Streaming (stream=False) RawAgent->>RawAgent: _prepare_run_context() [async] Note right of RawAgent: Builds: thread_messages, chat_options, tools RawAgent->>CML: chat_client.get_response(stream=False) else Streaming (stream=True) RawAgent->>RawAgent: ResponseStream.from_awaitable() Note right of RawAgent: Defers async prep to stream consumption RawAgent-->>User: Returns ResponseStream immediately Note over RawAgent,CML: Async work happens on iteration RawAgent->>RawAgent: _prepare_run_context() [deferred] RawAgent->>CML: chat_client.get_response(stream=True) end Note over CML,CMP: Chat Middleware Layer CML->>CMP: execute(ChatContext) loop Chat Middleware Chain CMP->>CMP: middleware[i].process(context, call_next) Note right of CMP: Can modify: messages, options end CMP->>FIL: get_response() via final_handler Note over FIL,Tool: Function Invocation Loop loop Max Iterations (default: 40) FIL->>Client: _inner_get_response(messages, options) Client->>LLM: API Call LLM-->>Client: Response (may include tool_calls) Client-->>FIL: ChatResponse alt Response has function_calls FIL->>FIL: _extract_function_calls() FIL->>FIL: _try_execute_function_calls() Note over FIL,Tool: Function Middleware Layer loop For each function_call FIL->>FMP: execute(FunctionInvocationContext) loop Function Middleware Chain FMP->>FMP: middleware[i].process(context, call_next) Note right of FMP: Can modify: arguments end FMP->>Tool: invoke(arguments) Tool-->>FMP: result FMP-->>FIL: Content.from_function_result() end FIL->>FIL: Append tool results to messages alt tool_choice == "required" Note right of FIL: Return immediately with function call + result FIL-->>CMP: ChatResponse else tool_choice == "auto" or other Note right of FIL: Continue loop for text response end else No function_calls FIL-->>CMP: ChatResponse end end CMP-->>CML: ChatResponse Note right of CMP: Can observe/modify result CML-->>RawAgent: ChatResponse / ResponseStream alt Non-Streaming RawAgent->>RawAgent: _finalize_response_and_update_thread() else Streaming Note right of RawAgent: .map() transforms updates Note right of RawAgent: .with_result_hook() runs post-processing end RawAgent-->>AMP: AgentResponse / ResponseStream Note right of AMP: Can observe/modify result AMP-->>AML: AgentResponse AML-->>Agent: AgentResponse Agent-->>User: AgentResponse / ResponseStream ``` ## Layer Details ### 1. Agent Middleware Layer (`AgentMiddlewareLayer`) **Entry Point:** `Agent.run(messages, thread, options, middleware)` **Context Object:** `AgentContext` | Field | Type | Description | |-------|------|-------------| | `agent` | `SupportsAgentRun` | The agent being invoked | | `messages` | `list[ChatMessage]` | Input messages (mutable) | | `thread` | `AgentThread \| None` | Conversation thread | | `options` | `Mapping[str, Any]` | Chat options dict | | `stream` | `bool` | Whether streaming is enabled | | `metadata` | `dict` | Shared data between middleware | | `result` | `AgentResponse \| None` | Set after `call_next()` is called | | `kwargs` | `Mapping[str, Any]` | Additional run arguments | **Key Operations:** 1. `categorize_middleware()` separates middleware by type (agent, chat, function) 2. Chat and function middleware are forwarded to `chat_client` 3. `AgentMiddlewarePipeline.execute()` runs the agent middleware chain 4. Final handler calls `RawChatAgent.run()` **What Can Be Modified:** - `context.messages` - Add, remove, or modify input messages - `context.options` - Change model parameters, temperature, etc. - `context.thread` - Replace or modify the thread - `context.result` - Override the final response (after `call_next()`) ### 2. Chat Middleware Layer (`ChatMiddlewareLayer`) **Entry Point:** `chat_client.get_response(messages, options)` **Context Object:** `ChatContext` | Field | Type | Description | |-------|------|-------------| | `chat_client` | `ChatClientProtocol` | The chat client | | `messages` | `Sequence[ChatMessage]` | Messages to send | | `options` | `Mapping[str, Any]` | Chat options | | `stream` | `bool` | Whether streaming | | `metadata` | `dict` | Shared data between middleware | | `result` | `ChatResponse \| None` | Set after `call_next()` is called | | `kwargs` | `Mapping[str, Any]` | Additional arguments | **Key Operations:** 1. `ChatMiddlewarePipeline.execute()` runs the chat middleware chain 2. Final handler calls `FunctionInvocationLayer.get_response()` 3. Stream hooks can be registered for streaming responses **What Can Be Modified:** - `context.messages` - Inject system prompts, filter content - `context.options` - Change model, temperature, tool_choice - `context.result` - Override the response (after `call_next()`) ### 3. Function Invocation Layer (`FunctionInvocationLayer`) **Entry Point:** `FunctionInvocationLayer.get_response()` This layer manages the tool execution loop: 1. **Calls** `BaseChatClient._inner_get_response()` to get LLM response 2. **Extracts** function calls from the response 3. **Executes** functions through the Function Middleware Pipeline 4. **Appends** results to messages and loops back to step 1 **Configuration:** `FunctionInvocationConfiguration` | Setting | Default | Description | |---------|---------|-------------| | `enabled` | `True` | Enable auto-invocation | | `max_iterations` | `40` | Maximum tool execution loops | | `max_consecutive_errors_per_request` | `3` | Error threshold before stopping | | `terminate_on_unknown_calls` | `False` | Raise error for unknown tools | | `additional_tools` | `[]` | Extra tools to register | | `include_detailed_errors` | `False` | Include exceptions in results | **`tool_choice` Behavior:** The `tool_choice` option controls how the model uses available tools: | Value | Behavior | |-------|----------| | `"auto"` | Model decides whether to call a tool or respond with text. After tool execution, the loop continues to get a text response. | | `"none"` | Model is prevented from calling tools, will only respond with text. | | `"required"` | Model **must** call a tool. After tool execution, returns immediately with the function call and result—**no additional model call** is made. | | `{"mode": "required", "required_function_name": "fn"}` | Model must call the specified function. Same return behavior as `"required"`. | **Why `tool_choice="required"` returns immediately:** When you set `tool_choice="required"`, your intent is to force one or more tool calls (not all models supports multiple, either by name or when using `required` without a name). The framework respects this by: 1. Getting the model's function call(s) 2. Executing the tool(s) 3. Returning the response(s) with both the function call message(s) and the function result(s) This avoids an infinite loop (model forced to call tools → executes → model forced to call tools again) and gives you direct access to the tool result. ```python # With tool_choice="required", response contains function call + result only response = await client.get_response( "What's the weather?", options={"tool_choice": "required", "tools": [get_weather]} ) # response.messages contains: # [0] Assistant message with function_call content # [1] Tool message with function_result content # (No text response from model) # To get a text response after tool execution, use tool_choice="auto" response = await client.get_response( "What's the weather?", options={"tool_choice": "auto", "tools": [get_weather]} ) # response.text contains the model's interpretation of the weather data ``` ### 4. Function Middleware Layer (`FunctionMiddlewarePipeline`) **Entry Point:** Called per function invocation within `_auto_invoke_function()` **Context Object:** `FunctionInvocationContext` | Field | Type | Description | |-------|------|-------------| | `function` | `FunctionTool` | The function being invoked | | `arguments` | `BaseModel` | Validated Pydantic arguments | | `metadata` | `dict` | Shared data between middleware | | `result` | `Any` | Set after `call_next()` is called | | `kwargs` | `Mapping[str, Any]` | Runtime kwargs | **What Can Be Modified:** - `context.arguments` - Modify validated arguments before execution - `context.result` - Override the function result (after `call_next()`) - Raise `MiddlewareTermination` to skip execution and terminate the function invocation loop **Special Behavior:** When `MiddlewareTermination` is raised in function middleware, it signals that the function invocation loop should exit **without making another LLM call**. This is useful when middleware determines that no further processing is needed (e.g., a termination condition is met). ```python class TerminatingMiddleware(FunctionMiddleware): async def process(self, context: FunctionInvocationContext, call_next): if self.should_terminate(context): context.result = "terminated by middleware" raise MiddlewareTermination # Exit function invocation loop await call_next(context) ``` ## Arguments Added/Altered at Each Layer ### Agent Layer → Chat Layer ```python # RawChatAgent._prepare_run_context() builds: { "thread": AgentThread, # Validated/created thread "input_messages": [...], # Normalized input messages "thread_messages": [...], # Messages from thread + context + input "agent_name": "...", # Agent name for attribution "chat_options": { "model_id": "...", "conversation_id": "...", # From thread.service_thread_id "tools": [...], # Normalized tools + MCP tools "temperature": ..., "max_tokens": ..., # ... other options }, "filtered_kwargs": {...}, # kwargs minus 'chat_options' "finalize_kwargs": {...}, # kwargs with 'thread' added } ``` ### Chat Layer → Function Layer ```python # Passed through to FunctionInvocationLayer: { "messages": [...], # Prepared messages "options": {...}, # Mutable copy of chat_options "function_middleware": [...], # Function middleware from kwargs } ``` ### Function Layer → Tool Invocation ```python # FunctionInvocationContext receives: { "function": FunctionTool, # The tool to invoke "arguments": BaseModel, # Validated from function_call.arguments "kwargs": { # Runtime kwargs (filtered, no conversation_id) }, } ``` ### Tool Result → Back Up ```python # Content.from_function_result() creates: { "type": "function_result", "call_id": "...", # From function_call.call_id "result": ..., # Serialized tool output "exception": "..." | None, # Error message if failed } ``` ## Middleware Control Flow There are three ways to exit a middleware's `process()` method: ### 1. Return Normally (with or without calling `call_next`) Returns control to the upstream middleware, allowing its post-processing code to run. ```python class CachingMiddleware(FunctionMiddleware): async def process(self, context: FunctionInvocationContext, call_next): # Option A: Return early WITHOUT calling call_next (skip downstream) if cached := self.cache.get(context.function.name): context.result = cached return # Upstream post-processing still runs # Option B: Call call_next, then return normally await call_next(context) self.cache[context.function.name] = context.result return # Normal completion ``` ### 2. Raise `MiddlewareTermination` Immediately exits the entire middleware chain. Upstream middleware's post-processing code is **skipped**. ```python class BlockedFunctionMiddleware(FunctionMiddleware): async def process(self, context: FunctionInvocationContext, call_next): if context.function.name in self.blocked_functions: context.result = "Function blocked by policy" raise MiddlewareTermination("Blocked") # Skips ALL post-processing await call_next(context) ``` ### 3. Raise Any Other Exception Bubbles up to the caller. The middleware chain is aborted and the exception propagates. ```python class ValidationMiddleware(FunctionMiddleware): async def process(self, context: FunctionInvocationContext, call_next): if not self.is_valid(context.arguments): raise ValueError("Invalid arguments") # Bubbles up to user await call_next(context) ``` ## `return` vs `raise MiddlewareTermination` The key difference is what happens to **upstream middleware's post-processing**: ```python class MiddlewareA(AgentMiddleware): async def process(self, context, call_next): print("A: before") await call_next(context) print("A: after") # Does this run? class MiddlewareB(AgentMiddleware): async def process(self, context, call_next): print("B: before") context.result = "early result" # Choose one: return # Option 1 # raise MiddlewareTermination() # Option 2 ``` With middleware registered as `[MiddlewareA, MiddlewareB]`: | Exit Method | Output | |-------------|--------| | `return` | `A: before` → `B: before` → `A: after` | | `raise MiddlewareTermination` | `A: before` → `B: before` (no `A: after`) | **Use `return`** when you want upstream middleware to still process the result (e.g., logging, metrics). **Use `raise MiddlewareTermination`** when you want to completely bypass all remaining processing (e.g., blocking a request, returning cached response without any modification). ## Calling `call_next()` or Not The decision to call `call_next(context)` determines whether downstream middleware and the actual operation execute: ### Without calling `call_next()` - Skip downstream ```python async def process(self, context, call_next): context.result = "replacement result" return # Downstream middleware and actual execution are SKIPPED ``` - Downstream middleware: ❌ NOT executed - Actual operation (LLM call, function invocation): ❌ NOT executed - Upstream middleware post-processing: ✅ Still runs (unless `MiddlewareTermination` raised) - Result: Whatever you set in `context.result` ### With calling `call_next()` - Full execution ```python async def process(self, context, call_next): # Pre-processing await call_next(context) # Execute downstream + actual operation # Post-processing (context.result now contains real result) return ``` - Downstream middleware: ✅ Executed - Actual operation: ✅ Executed - Upstream middleware post-processing: ✅ Runs - Result: The actual result (possibly modified in post-processing) ### Summary Table | Exit Method | Call `call_next()`? | Downstream Executes? | Actual Op Executes? | Upstream Post-Processing? | |-------------|----------------|---------------------|---------------------|--------------------------| | `return` (or implicit) | Yes | ✅ | ✅ | ✅ Yes | | `return` | No | ❌ | ❌ | ✅ Yes | | `raise MiddlewareTermination` | No | ❌ | ❌ | ❌ No | | `raise MiddlewareTermination` | Yes | ✅ | ✅ | ❌ No | | `raise OtherException` | Either | Depends | Depends | ❌ No (exception propagates) | > **Note:** The first row (`return` after calling `call_next()`) is the default behavior. Python functions implicitly return `None` at the end, so simply calling `await call_next(context)` without an explicit `return` statement achieves this pattern. ## Streaming vs Non-Streaming The `run()` method handles streaming and non-streaming differently: ### Non-Streaming (`stream=False`) Returns `Awaitable[AgentResponse]`: ```python async def _run_non_streaming(): ctx = await self._prepare_run_context(...) # Async preparation response = await self.chat_client.get_response(stream=False, ...) await self._finalize_response_and_update_thread(...) return AgentResponse(...) ``` ### Streaming (`stream=True`) Returns `ResponseStream[AgentResponseUpdate, AgentResponse]` **synchronously**: ```python # Async preparation is deferred using ResponseStream.from_awaitable() async def _get_stream(): ctx = await self._prepare_run_context(...) # Deferred until iteration return self.chat_client.get_response(stream=True, ...) return ( ResponseStream.from_awaitable(_get_stream()) .map( transform=map_chat_to_agent_update, # Transform each update finalizer=self._finalize_response_updates, # Build final response ) .with_result_hook(_post_hook) # Post-processing after finalization ) ``` Key points: - `ResponseStream.from_awaitable()` wraps an async function, deferring execution until the stream is consumed - `.map()` transforms `ChatResponseUpdate` → `AgentResponseUpdate` and provides the finalizer - `.with_result_hook()` runs after finalization (e.g., notify thread of new messages) ## See Also - [Middleware Samples](../../getting_started/middleware/) - Examples of custom middleware - [Function Tool Samples](../../getting_started/tools/) - Creating and using tools