mirror of
https://github.com/microsoft/agent-framework.git
synced 2026-06-16 21:04:09 +08:00
Python: Use AI Foundry evaluators for self-reflection (#2250)
* First working version * Simplify the implementations * Remove unused env var * Update Python syntax * Address feedbacks * Fix a typo * Update names as review suggestions * Citation for self-reflection * Move to independent folder * Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/README.md Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com> * Updated from parquet to JSONL and hide the default environment variables * As review feedback, remove the purpose of using `run_self_reflection_batch` as a library, only use it as sample code * Update python/samples/getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com> --------- Co-authored-by: Eduard van Valkenburg <eavanvalkenburg@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
Unverified
parent
92df9e14bf
commit
b3e96b80ae
@@ -185,6 +185,7 @@ This directory contains samples demonstrating the capabilities of Microsoft Agen
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| [`getting_started/evaluation/azure_ai_foundry/red_team_agent_sample.py`](./getting_started/evaluation/azure_ai_foundry/red_team_agent_sample.py) | Red team agent evaluation sample for Azure AI Foundry |
|
||||
| [`getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py`](./getting_started/evaluation/azure_ai_foundry/evaluation/self_reflection.py) | LLM self-reflection with AI Foundry graders example |
|
||||
|
||||
## MCP (Model Context Protocol)
|
||||
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
AZURE_OPENAI_ENDPOINT="..."
|
||||
AZURE_OPENAI_API_KEY="..."
|
||||
@@ -0,0 +1,75 @@
|
||||
# Self-Reflection Evaluation Sample
|
||||
|
||||
This sample demonstrates the self-reflection pattern using Agent Framework and Azure AI Foundry's Groundedness Evaluator. For details, see [Reflexion: Language Agents with Verbal Reinforcement Learning](https://arxiv.org/abs/2303.11366) (NeurIPS 2023).
|
||||
|
||||
## Overview
|
||||
|
||||
**What it demonstrates:**
|
||||
- Iterative self-reflection loop that automatically improves responses based on groundedness evaluation
|
||||
- Batch processing of prompts from Parquet files with progress tracking
|
||||
- Using `AzureOpenAIChatClient` with Azure CLI authentication
|
||||
- Comprehensive summary statistics and detailed result tracking
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Azure Resources
|
||||
- **Azure OpenAI**: Deploy models (default: gpt-4.1 for both agent and judge)
|
||||
- **Azure CLI**: Run `az login` to authenticate
|
||||
|
||||
### Python Environment
|
||||
```bash
|
||||
pip install agent-framework-core azure-ai-evaluation pandas --pre
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# .env file
|
||||
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
|
||||
AZURE_OPENAI_API_KEY=your-api-key # Optional with Azure CLI
|
||||
```
|
||||
|
||||
## Running the Sample
|
||||
|
||||
```bash
|
||||
# Basic usage
|
||||
python self_reflection.py
|
||||
|
||||
# With options
|
||||
python self_reflection.py --input my_prompts.parquet \
|
||||
--output results.parquet \
|
||||
--max-reflections 5 \
|
||||
-n 10
|
||||
```
|
||||
|
||||
**CLI Options:**
|
||||
- `--input`, `-i`: Input parquet file
|
||||
- `--output`, `-o`: Output parquet file
|
||||
- `--agent-model`, `-m`: Agent model name (default: gpt-4.1)
|
||||
- `--judge-model`, `-e`: Evaluator model name (default: gpt-4.1)
|
||||
- `--max-reflections`: Max iterations (default: 3)
|
||||
- `--limit`, `-n`: Process only first N prompts
|
||||
|
||||
## Understanding Results
|
||||
|
||||
The agent iteratively improves responses:
|
||||
1. Generate initial response
|
||||
2. Evaluate groundedness (1-5 scale)
|
||||
3. If score < 5, provide feedback and retry
|
||||
4. Stop at max iterations or perfect score (5/5)
|
||||
|
||||
**Example output:**
|
||||
```
|
||||
[1/31] Processing prompt 0...
|
||||
Self-reflection iteration 1/3...
|
||||
Groundedness score: 3/5
|
||||
Self-reflection iteration 2/3...
|
||||
Groundedness score: 5/5
|
||||
✓ Perfect groundedness score achieved!
|
||||
✓ Completed with score: 5/5 (best at iteration 2/3)
|
||||
```
|
||||
|
||||
## Related Resources
|
||||
|
||||
- [Reflexion Paper](https://arxiv.org/abs/2303.11366)
|
||||
- [Azure AI Evaluation SDK](https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk)
|
||||
- [Agent Framework](https://github.com/microsoft/agent-framework)
|
||||
+31
File diff suppressed because one or more lines are too long
+381
@@ -0,0 +1,381 @@
|
||||
"""
|
||||
Self-Reflection LLM Runner
|
||||
|
||||
Reflexion: language agents with verbal reinforcement learning.
|
||||
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023.
|
||||
In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 377, 8634–8652.
|
||||
https://arxiv.org/abs/2303.11366
|
||||
|
||||
This module implements a self-reflection loop for LLM responses using groundedness evaluation.
|
||||
It loads prompts from a JSONL file, runs them through an LLM with self-reflection,
|
||||
and saves the results.
|
||||
|
||||
|
||||
Usage as CLI:
|
||||
python self_reflection.py
|
||||
|
||||
Usage as CLI with extra options:
|
||||
python self_reflection.py --input resources/suboptimal_groundedness_prompts.jsonl \\
|
||||
--output resources/results.jsonl \\
|
||||
--max-reflections 3 \\
|
||||
-n 10 # Optional: process only first 10 prompts
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import time
|
||||
import argparse
|
||||
import pandas as pd
|
||||
from typing import Dict, Any, Optional
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from agent_framework import ChatAgent, ChatMessage
|
||||
from agent_framework.azure import AzureOpenAIChatClient
|
||||
from azure.identity import AzureCliCredential
|
||||
from azure.ai.evaluation import GroundednessEvaluator, AzureOpenAIModelConfiguration
|
||||
|
||||
|
||||
DEFAULT_AGENT_MODEL = "gpt-4.1"
|
||||
DEFAULT_JUDGE_MODEL = "gpt-4.1"
|
||||
|
||||
|
||||
def create_groundedness_evaluator(judge_model: str) -> GroundednessEvaluator:
|
||||
"""
|
||||
Create a groundedness evaluator.
|
||||
|
||||
Args:
|
||||
judge_model: Model deployment name for evaluation
|
||||
Returns:
|
||||
Configured GroundednessEvaluator
|
||||
"""
|
||||
judge_model_config = AzureOpenAIModelConfiguration(
|
||||
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
|
||||
api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
|
||||
api_version="2024-12-01-preview",
|
||||
azure_deployment=judge_model,
|
||||
)
|
||||
return GroundednessEvaluator(model_config=judge_model_config)
|
||||
|
||||
|
||||
async def execute_query_with_self_reflection(
|
||||
*,
|
||||
agent: ChatAgent,
|
||||
full_user_query: str,
|
||||
context: str,
|
||||
evaluator: GroundednessEvaluator,
|
||||
max_self_reflections: int = 3,
|
||||
) -> dict[str, Any]:
|
||||
"""
|
||||
Execute a query with self-reflection loop.
|
||||
|
||||
Args:
|
||||
agent: ChatAgent instance to use for generating responses
|
||||
full_user_query: Complete prompt including system prompt, user request, and context
|
||||
context: Context document for groundedness evaluation
|
||||
evaluator: Groundedness evaluator function
|
||||
max_self_reflections: Maximum number of self-reflection iterations
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- best_response: The best response achieved
|
||||
- best_response_score: Best groundedness score
|
||||
- best_iteration: Iteration number where best score was achieved
|
||||
- iteration_scores: List of groundedness scores for each iteration
|
||||
- messages: Full conversation history
|
||||
- usage_metadata: Token usage information
|
||||
- num_retries: Number of iterations performed
|
||||
- total_groundedness_eval_time: Time spent on evaluations (seconds)
|
||||
- total_end_to_end_time: Total execution time (seconds)
|
||||
"""
|
||||
messages = [ChatMessage(role="user", text=full_user_query)]
|
||||
|
||||
best_score = 0
|
||||
max_score = 5
|
||||
best_response = None
|
||||
best_iteration = 0
|
||||
raw_response = None
|
||||
total_groundedness_eval_time = 0.0
|
||||
start_time = time.time()
|
||||
iteration_scores = [] # Store all iteration scores in structured format
|
||||
|
||||
for i in range(max_self_reflections):
|
||||
print(f" Self-reflection iteration {i+1}/{max_self_reflections}...")
|
||||
|
||||
raw_response = await agent.run(messages=messages)
|
||||
agent_response = raw_response.text
|
||||
|
||||
# Evaluate groundedness
|
||||
start_time_eval = time.time()
|
||||
groundedness_res = evaluator(
|
||||
query=full_user_query,
|
||||
response=agent_response,
|
||||
context=context
|
||||
)
|
||||
end_time_eval = time.time()
|
||||
total_groundedness_eval_time += (end_time_eval - start_time_eval)
|
||||
|
||||
feedback = groundedness_res['groundedness_reason']
|
||||
score = int(groundedness_res['groundedness'])
|
||||
|
||||
# Store score in structured format
|
||||
iteration_scores.append(score)
|
||||
|
||||
# Show groundedness score
|
||||
print(f" Groundedness score: {score}/{max_score}")
|
||||
|
||||
# Update best response if improved
|
||||
if score > best_score:
|
||||
if best_score > 0:
|
||||
print(f" ✓ Score improved from {best_score} to {score}/{max_score}")
|
||||
best_score = score
|
||||
best_response = agent_response
|
||||
best_iteration = i + 1
|
||||
if score == max_score:
|
||||
print(f" ✓ Perfect groundedness score achieved!")
|
||||
break
|
||||
else:
|
||||
print(f" → No improvement (score: {score}/{max_score}). Trying again...")
|
||||
|
||||
# Add to conversation history
|
||||
messages.append(ChatMessage(role="assistant", text=agent_response))
|
||||
|
||||
# Request improvement
|
||||
reflection_prompt = (
|
||||
f"The groundedness score of your response is {score}/{max_score}. "
|
||||
f"Explanation for score: [{feedback}]. "
|
||||
f"Reflect on your answer and improve it to get the maximum score of {max_score} "
|
||||
f"considering the explanation. Now please provide an updated response, taking into "
|
||||
f"account the feedback, but make your answer sound as if it was your first response. "
|
||||
f"Don't refer to the feedback in your answer."
|
||||
)
|
||||
messages.append(ChatMessage(role="user", text=reflection_prompt))
|
||||
|
||||
end_time = time.time()
|
||||
latency = end_time - start_time
|
||||
|
||||
# Handle edge case where no response improved the score
|
||||
if best_response is None and raw_response is not None and len(raw_response.messages) > 0:
|
||||
best_response = raw_response.messages[0].text
|
||||
best_iteration = i + 1
|
||||
|
||||
return {
|
||||
"best_response": best_response,
|
||||
"best_response_score": best_score,
|
||||
"best_iteration": best_iteration,
|
||||
"iteration_scores": iteration_scores, # Structured list of all scores
|
||||
"messages": [message.to_json() for message in messages],
|
||||
"num_retries": i + 1,
|
||||
"total_groundedness_eval_time": total_groundedness_eval_time,
|
||||
"total_end_to_end_time": latency,
|
||||
}
|
||||
|
||||
|
||||
async def run_self_reflection_batch(
|
||||
input_file: str,
|
||||
output_file: str,
|
||||
agent_model: str = DEFAULT_AGENT_MODEL,
|
||||
judge_model: str = DEFAULT_JUDGE_MODEL,
|
||||
max_self_reflections: int = 3,
|
||||
env_file: str | None = None,
|
||||
limit: int | None = None
|
||||
):
|
||||
"""
|
||||
Run self-reflection on a batch of prompts.
|
||||
|
||||
Args:
|
||||
input_file: Path to input JSONL file with prompts
|
||||
output_file: Path to save output JSONL file
|
||||
agent_model: Model to use for generating responses
|
||||
judge_model: Model to use for groundedness evaluation
|
||||
max_self_reflections: Maximum number of self-reflection iterations
|
||||
env_file: Optional path to .env file
|
||||
limit: Optional limit to process only the first N prompts
|
||||
"""
|
||||
# Load environment variables
|
||||
if env_file and os.path.exists(env_file):
|
||||
load_dotenv(env_file, override=True)
|
||||
else:
|
||||
load_dotenv(override=True)
|
||||
|
||||
# Create agent, it loads environment variables AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT automatically
|
||||
agent = AzureOpenAIChatClient(
|
||||
credential=AzureCliCredential(),
|
||||
deployment_name=agent_model,
|
||||
).create_agent(
|
||||
instructions="You are a helpful agent.",
|
||||
)
|
||||
|
||||
# Load input data
|
||||
print(f"Loading prompts from: {input_file}")
|
||||
df = pd.read_json(input_file, lines=True)
|
||||
print(f"Loaded {len(df)} prompts")
|
||||
|
||||
# Apply limit if specified
|
||||
if limit is not None and limit > 0:
|
||||
df = df.head(limit)
|
||||
print(f"Processing first {len(df)} prompts (limited by -n {limit})")
|
||||
|
||||
# Validate required columns
|
||||
required_columns = ['system_instruction', 'user_request', 'context_document',
|
||||
'full_prompt', 'domain', 'type', 'high_level_type']
|
||||
missing_columns = [col for col in required_columns if col not in df.columns]
|
||||
if missing_columns:
|
||||
raise ValueError(f"Input file missing required columns: {missing_columns}")
|
||||
|
||||
# Configure clients
|
||||
print(f"Configuring Azure OpenAI client...")
|
||||
|
||||
print(f"Creating groundedness evaluator with model: {judge_model}")
|
||||
evaluator = create_groundedness_evaluator(judge_model)
|
||||
|
||||
# Process each prompt
|
||||
print(f"Max self-reflections: {max_self_reflections}\n")
|
||||
|
||||
results = []
|
||||
for counter, (idx, row) in enumerate(df.iterrows(), start=1):
|
||||
print(f"[{counter}/{len(df)}] Processing prompt {row.get('original_index', idx)}...")
|
||||
|
||||
try:
|
||||
result = await execute_query_with_self_reflection(
|
||||
agent=agent,
|
||||
full_user_query=row['full_prompt'],
|
||||
context=row['context_document'],
|
||||
evaluator=evaluator,
|
||||
max_self_reflections=max_self_reflections,
|
||||
)
|
||||
|
||||
# Prepare result data
|
||||
result_data = {
|
||||
"original_index": row.get('original_index', idx),
|
||||
"domain": row['domain'],
|
||||
"question_type": row['type'],
|
||||
"high_level_type": row['high_level_type'],
|
||||
"full_prompt": row['full_prompt'],
|
||||
"system_prompt": row['system_instruction'],
|
||||
"user_request": row['user_request'],
|
||||
"context_document": row['context_document'],
|
||||
"agent_response_model": agent_model,
|
||||
"agent_response": result,
|
||||
"error": None,
|
||||
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
|
||||
}
|
||||
results.append(result_data)
|
||||
|
||||
print(f" ✓ Completed with score: {result['best_response_score']}/5 "
|
||||
f"(best at iteration {result['best_iteration']}/{result['num_retries']}, "
|
||||
f"time: {result['total_end_to_end_time']:.1f}s)\n")
|
||||
|
||||
except Exception as e:
|
||||
print(f" ✗ Error: {str(e)}\n")
|
||||
|
||||
# Save error information
|
||||
error_data = {
|
||||
"original_index": row.get('original_index', idx),
|
||||
"domain": row['domain'],
|
||||
"question_type": row['type'],
|
||||
"high_level_type": row['high_level_type'],
|
||||
"full_prompt": row['full_prompt'],
|
||||
"system_prompt": row['system_instruction'],
|
||||
"user_request": row['user_request'],
|
||||
"context_document": row['context_document'],
|
||||
"agent_response_model": agent_model,
|
||||
"agent_response": None,
|
||||
"error": str(e),
|
||||
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
|
||||
}
|
||||
results.append(error_data)
|
||||
continue
|
||||
|
||||
# Create DataFrame and save
|
||||
results_df = pd.DataFrame(results)
|
||||
|
||||
print(f"\nSaving results to: {output_file}")
|
||||
results_df.to_json(output_file, orient='records', lines=True)
|
||||
|
||||
# Generate detailed summary
|
||||
successful_runs = results_df[results_df['error'].isna()]
|
||||
failed_runs = results_df[results_df['error'].notna()]
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("SUMMARY")
|
||||
print("="*60)
|
||||
print(f"Total prompts processed: {len(results_df)}")
|
||||
print(f" ✓ Successful: {len(successful_runs)}")
|
||||
print(f" ✗ Failed: {len(failed_runs)}")
|
||||
|
||||
if len(successful_runs) > 0:
|
||||
# Extract scores and iteration data from nested agent_response dict
|
||||
best_scores = [r['best_response_score'] for r in successful_runs['agent_response'] if r is not None]
|
||||
iterations = [r['best_iteration'] for r in successful_runs['agent_response'] if r is not None]
|
||||
iteration_scores_list = [r['iteration_scores'] for r in successful_runs['agent_response'] if r is not None and 'iteration_scores' in r]
|
||||
|
||||
if best_scores:
|
||||
avg_score = sum(best_scores) / len(best_scores)
|
||||
perfect_scores = sum(1 for s in best_scores if s == 5)
|
||||
print(f"\nGroundedness Scores:")
|
||||
print(f" Average best score: {avg_score:.2f}/5")
|
||||
print(f" Perfect scores (5/5): {perfect_scores}/{len(best_scores)} ({100*perfect_scores/len(best_scores):.1f}%)")
|
||||
|
||||
# Calculate improvement metrics
|
||||
if iteration_scores_list:
|
||||
first_scores = [scores[0] for scores in iteration_scores_list if len(scores) > 0]
|
||||
last_scores = [scores[-1] for scores in iteration_scores_list if len(scores) > 0]
|
||||
improvements = [last - first for first, last in zip(first_scores, last_scores)]
|
||||
improved_count = sum(1 for imp in improvements if imp > 0)
|
||||
|
||||
if first_scores and last_scores:
|
||||
avg_first_score = sum(first_scores) / len(first_scores)
|
||||
avg_last_score = sum(last_scores) / len(last_scores)
|
||||
avg_improvement = sum(improvements) / len(improvements)
|
||||
|
||||
print(f"\nImprovement Analysis:")
|
||||
print(f" Average first score: {avg_first_score:.2f}/5")
|
||||
print(f" Average final score: {avg_last_score:.2f}/5")
|
||||
print(f" Average improvement: +{avg_improvement:.2f}")
|
||||
print(f" Responses that improved: {improved_count}/{len(improvements)} ({100*improved_count/len(improvements):.1f}%)")
|
||||
|
||||
# Show iteration statistics
|
||||
if iterations:
|
||||
avg_iteration = sum(iterations) / len(iterations)
|
||||
first_try = sum(1 for it in iterations if it == 1)
|
||||
print(f"\nIteration Statistics:")
|
||||
print(f" Average best iteration: {avg_iteration:.2f}")
|
||||
print(f" Best on first try: {first_try}/{len(iterations)} ({100*first_try/len(iterations):.1f}%)")
|
||||
|
||||
print("="*60)
|
||||
|
||||
|
||||
async def main():
|
||||
"""CLI entry point."""
|
||||
parser = argparse.ArgumentParser(description="Run self-reflection loop on LLM prompts with groundedness evaluation")
|
||||
parser.add_argument('--input', '-i', default="resources/suboptimal_groundedness_prompts.jsonl", help='Input JSONL file with prompts')
|
||||
parser.add_argument('--output', '-o', default="resources/results.jsonl", help='Output JSONL file for results')
|
||||
parser.add_argument('--agent-model', '-m', default=DEFAULT_AGENT_MODEL, help=f'Agent model deployment name (default: {DEFAULT_AGENT_MODEL})')
|
||||
parser.add_argument('--judge-model', '-e', default=DEFAULT_JUDGE_MODEL, help=f'Judge model deployment name (default: {DEFAULT_JUDGE_MODEL})')
|
||||
parser.add_argument('--max-reflections', type=int, default=3, help='Maximum number of self-reflection iterations (default: 3)')
|
||||
parser.add_argument('--env-file', help='Path to .env file with Azure OpenAI credentials')
|
||||
parser.add_argument('--limit', '-n', type=int, default=None, help='Process only the first N prompts from the input file')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Run the batch processing
|
||||
try:
|
||||
await run_self_reflection_batch(
|
||||
input_file=args.input,
|
||||
output_file=args.output,
|
||||
agent_model=args.agent_model,
|
||||
judge_model=args.judge_model,
|
||||
max_self_reflections=args.max_reflections,
|
||||
env_file=args.env_file,
|
||||
limit=args.limit
|
||||
)
|
||||
print("\n✓ Processing complete!")
|
||||
|
||||
except Exception as e:
|
||||
print(f"\n✗ Error: {str(e)}")
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
exit(asyncio.run(main()))
|
||||
Reference in New Issue
Block a user