mirror of https://github.com/microsoft/agent-framework.git synced 2026-06-16 21:04:09 +08:00

Files

T

Eduard van Valkenburg aab80d9ed9 Python: Fix Eval samples (#4033 )

* fix red team sample

* Updated self-reflection

* fix for workflow eval sample

* fix test

2026-02-18 19:50:33 +00:00

2.6 KiB

Raw Blame History

Self-Reflection Evaluation Sample

This sample demonstrates the self-reflection pattern using Agent Framework and Azure AI Foundry's Groundedness Evaluator. For details, see Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023).

Overview

What it demonstrates:

Iterative self-reflection loop that automatically improves responses based on groundedness evaluation
Batch processing of prompts from JSONL files with progress tracking
Using AzureOpenAIResponsesClient with a Project Endpoint and Azure CLI authentication
Comprehensive summary statistics and detailed result tracking

Prerequisites

Azure Resources

Azure OpenAI Responses in Foundry: Deploy models (default: gpt-5.2 for both agent and judge)
Azure CLI: Run az login to authenticate

Python Environment

pip install agent-framework-core pandas --pre

Environment Variables

AZURE_AI_PROJECT_ENDPOINT=https://<your-ai-resource>.services.ai.azure.com/api/projects/<your-ai-project>/

Running the Sample

# Basic usage
python self_reflection.py

# With options
python self_reflection.py --input my_prompts.jsonl \
                          --output results.jsonl \
                          --max-reflections 5 \
                          -n 10

CLI Options:

--input, -i: Input JSONL file
--output, -o: Output JSONL file
--agent-model, -m: Agent model name (default: gpt-4.1)
--judge-model, -e: Evaluator model name (default: gpt-4.1)
--max-reflections: Max iterations (default: 3)
--limit, -n: Process only first N prompts

Understanding Results

The agent iteratively improves responses:

Generate initial response
Evaluate groundedness (1-5 scale)
If score < 5, provide feedback and retry
Stop at max iterations or perfect score (5/5)

Example output:

[1/31] Processing prompt 0...
  Self-reflection iteration 1/3...
  Groundedness score: 3/5
  Self-reflection iteration 2/3...
  Groundedness score: 5/5
  ✓ Perfect groundedness score achieved!
  ✓ Completed with score: 5/5 (best at iteration 2/3)

In the Foundry UI, under Build/Evaluations you can view detailed results for each prompt, including:

Context
Query
Response
Groundedness scores and reasoning for each interation of each prompt

2.6 KiB Raw Blame History