mirror of https://github.com/microsoft/agent-framework.git synced 2026-06-16 21:04:09 +08:00

Files

T

David Wu e5b63a1041 Python: Move evaluation folders to under evaluations (#2355 )

* Move evaluation folders to under evaluations

* Change folder path

2025-11-20 20:50:23 +00:00

2.4 KiB

Raw Blame History

Self-Reflection Evaluation Sample

This sample demonstrates the self-reflection pattern using Agent Framework and Azure AI Foundry's Groundedness Evaluator. For details, see Reflexion: Language Agents with Verbal Reinforcement Learning (NeurIPS 2023).

Overview

What it demonstrates:

Iterative self-reflection loop that automatically improves responses based on groundedness evaluation
Batch processing of prompts from Parquet files with progress tracking
Using AzureOpenAIChatClient with Azure CLI authentication
Comprehensive summary statistics and detailed result tracking

Prerequisites

Azure Resources

Azure OpenAI: Deploy models (default: gpt-4.1 for both agent and judge)
Azure CLI: Run az login to authenticate

Python Environment

pip install agent-framework-core azure-ai-evaluation pandas --pre

Environment Variables

# .env file
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key  # Optional with Azure CLI

Running the Sample

# Basic usage
python self_reflection.py

# With options
python self_reflection.py --input my_prompts.parquet \
                          --output results.parquet \
                          --max-reflections 5 \
                          -n 10

CLI Options:

--input, -i: Input parquet file
--output, -o: Output parquet file
--agent-model, -m: Agent model name (default: gpt-4.1)
--judge-model, -e: Evaluator model name (default: gpt-4.1)
--max-reflections: Max iterations (default: 3)
--limit, -n: Process only first N prompts

Understanding Results

The agent iteratively improves responses:

Generate initial response
Evaluate groundedness (1-5 scale)
If score < 5, provide feedback and retry
Stop at max iterations or perfect score (5/5)

Example output:

[1/31] Processing prompt 0...
  Self-reflection iteration 1/3...
  Groundedness score: 3/5
  Self-reflection iteration 2/3...
  Groundedness score: 5/5
  ✓ Perfect groundedness score achieved!
  ✓ Completed with score: 5/5 (best at iteration 2/3)

2.4 KiB Raw Blame History