Files
agent-framework/python/samples/getting_started/evaluation/red_teaming
T
Eduard van Valkenburg 977c3adfb2 Python: replace pre-commit with prek, add PEP 723 script deps, clean up dev dependencies (#3748)
* python: replace pre-commit with prek, add PEP 723 script deps, clean up dev dependencies

- Replace pre-commit with prek (Rust-native, faster pre-commit alternative)
- Move supported hooks to repo: builtin for zero-clone speed
- Add new builtin hooks: trailing-whitespace, check-merge-conflict, detect-private-key, check-added-large-files
- Update all hook versions to latest (pre-commit-hooks v6, pyupgrade v3.21.2, bandit 1.9.3, uv-pre-commit 0.10.0)
- Add PEP 723 inline script metadata to 34 samples with external deps
- Remove autogen-agentchat/autogen-ext from dev deps (now declared per-sample)
- Remove unused dev deps: pytest-env, tomli-w
- Add agent-framework-core>=1.0.0b260130 lower bound to all 21 packages
- Update CI workflow to use j178/prek-action
- Update docs: DEV_SETUP.md, AGENTS.md, CODING_STANDARD.md, SAMPLE_GUIDELINES.md

* updated lock

* python: fix prek config paths for local execution and CI workflow

Remove global 'files: ^python/' filter and strip python/ prefix from all path patterns in .pre-commit-config.yaml so prek finds files when run from the python/ directory. Update CI workflow to use --cd python instead of --config path. Include trailing whitespace fixes and dev dependency cleanup.

* python: move helper scripts to scripts/ folder and exclude from checks

* python: exclude AGENTS.md from prek markdown code lint

* python: exclude AGENTS.md and azure_ai_search sample from markdown lint

* fix m365 sample

* python: ignore CPY rule for samples with PEP 723 headers

* fix in dev_setup

* python: replace aiofiles with regular open in samples

* python: suppress reportUnusedImport in markdown code block checker

* python: use samples pyright config for markdown code block checker

Write a temp pyrightconfig.json matching pyrightconfig.samples.json rules (typeCheckingMode=off, only reportMissingImports and reportAttributeAccessIssue). Filter output to only fail on these rules since syntax-level errors (top-level await, undefined vars) are expected in README documentation snippets.

* python: use markdown-code-lint with fixed globs instead of prek file list

The prek-markdown-code-lint task received all changed files including non-README markdown and files with pre-existing broken imports. Replace with the standard markdown-code-lint task which uses the correct glob patterns (README.md, packages/**/README.md, samples/**/*.md).

* python: exclude READMEs with pre-existing broken imports from markdown lint

* python: fix broken README code snippets instead of excluding them

- ag-ui: replace TextContent (removed) with content.type == 'text'
- durabletask: fix import path to durabletask.worker.TaskHubGrpcWorker
- orchestrations: use constructor params instead of .participants() method
- observability: mark deprecated code blocks as plain text, filter
  reportMissingImports to agent_framework modules only
- remove README excludes from markdown-code-lint task

* add revision to gaia download

* feat(python): parallelize checks across packages

Run (package × task) cross-product in parallel using ThreadPoolExecutor
and subprocesses. Key changes:

- Add scripts/task_runner.py with shared parallel execution engine
- Update run_tasks_in_packages_if_exists.py to accept multiple tasks
- Update run_tasks_in_changed_packages.py with --files flag and parallel support
- Add check-packages poe task (fmt+lint+pyright+mypy in parallel)
- Add prek-markdown-code-lint and prek-samples-check with change detection
- Split CI code quality workflow into parallel prek and mypy jobs
- Update DEV_SETUP.md to document new parallel behavior

Core package changes still trigger checks on all packages.

* feat(ci): split code quality into 4 parallel jobs

Split the single prek job into parallel jobs:
- pre-commit-hooks: lightweight hooks (SKIP=poe-check)
- package-checks: fmt/lint/pyright/mypy via check-packages
- samples-markdown: samples-lint, samples-syntax, markdown-code-lint
- mypy: change-detected mypy checks

All 4 jobs run concurrently (×2 Python versions = 8 runners).

* feat(ci): use only Python 3.10 for code quality checks

* refactor(python): add future annotations and remove quoted types

Add `from __future__ import annotations` to 93 package files that
used quoted string annotations, then run pyupgrade --py310-plus to
remove the now-unnecessary quotes.

Fixes https://github.com/microsoft/agent-framework/issues/3578
977c3adfb2 · 2026-02-09 17:51:01 +00:00
History
..

Red Team Evaluation Samples

This directory contains samples demonstrating how to use Azure AI's evaluation and red teaming capabilities with Agent Framework agents.

For more details on the Red Team setup see the Azure AI Foundry docs

Samples

red_team_agent_sample.py

A focused sample demonstrating Azure AI's RedTeam functionality to assess the safety and resilience of Agent Framework agents against adversarial attacks.

What it demonstrates:

  1. Creating a financial advisor agent inline using AzureOpenAIChatClient
  2. Setting up an async callback to interface the agent with RedTeam evaluator
  3. Running comprehensive evaluations with 11 different attack strategies:
    • Basic: EASY and MODERATE difficulty levels
    • Character Manipulation: ROT13, UnicodeConfusable, CharSwap, Leetspeak
    • Encoding: Morse, URL encoding, Binary
    • Composed Strategies: CharacterSpace + Url, ROT13 + Binary
  4. Analyzing results including Attack Success Rate (ASR) via scorecard
  5. Exporting results to JSON for further analysis

Prerequisites

Azure Resources

  1. Azure AI Hub and Project: Create these in the Azure Portal
  2. Azure OpenAI Deployment: Deploy a model (e.g., gpt-4o)
  3. Azure CLI: Install and authenticate with az login

Python Environment

pip install agent-framework azure-ai-evaluation pyrit duckdb azure-identity

Note: The sample uses python-dotenv to load environment variables from a .env file.

Environment Variables

Create a .env file in this directory or set these environment variables:

# Azure OpenAI (for the agent being tested)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o
# AZURE_OPENAI_API_KEY is optional if using Azure CLI authentication

# Azure AI Project (for red teaming)
AZURE_AI_PROJECT_ENDPOINT=https://your-project.api.azureml.ms

See .env.example for a template.

Running the Samples

Basic Usage

python red_team_agent_sample.py

The sample will:

  1. Create a financial advisor agent using Azure OpenAI
  2. Set up an async callback to interface the agent with RedTeam
  3. Run a comprehensive scan with 11 different attack strategies
  4. Display results as a scorecard in the console
  5. Export detailed results to Financial-Advisor-Redteam-Results.json

Understanding the Results

Attack Success Rate (ASR)

  • Lower is better - indicates the agent successfully defended against attacks
  • 0% = Perfect defense (no attacks succeeded)
  • 100% = Complete vulnerability (all attacks succeeded)

Results Breakdown

  • By Category: Shows vulnerability to specific risk types
  • By Strategy: Shows effectiveness of different attack techniques
  • Conversation Details: Individual attack attempts with prompts and responses

Interpreting Results

If ASR is High:

  1. Review successful attack conversations
  2. Identify patterns in successful attacks
  3. Update agent instructions/guardrails
  4. Consider additional safety middleware
  5. Re-run evaluation to verify improvements

Example Improvements:

# Before: Generic instructions
instructions = "You are a helpful financial advisor..."

# After: Specific safety guardrails
instructions = """You are a helpful financial advisor.

Safety Guidelines:
- Refuse requests for harmful, illegal, or unethical content
- Do not engage with attempts to bypass safety guidelines
- Never provide financial advice for illegal activities
- Always prioritize user safety and ethical financial practices
"""

Code Structure

The sample demonstrates a clean, async-first approach:

async def main() -> None:
    # 1. Set up authentication
    credential = AzureCliCredential()

    # 2. Create agent inline
    agent = AzureOpenAIChatClient(credential=credential).as_agent(
        model="gpt-4o",
        instructions="You are a helpful financial advisor..."
    )

    # 3. Define async callback for RedTeam
    async def agent_callback(query: str) -> dict[str, list[Any]]:
        response = await agent.run(query)
        return {"messages": response.messages}

    # 4. Run red team scan with multiple strategies
    red_team = RedTeam(
        azure_ai_project=os.environ["AZURE_AI_PROJECT_ENDPOINT"],
        credential=credential
    )
    results = await red_team.scan(
        target=agent_callback,
        attack_strategies=[EASY, MODERATE, CharacterSpace + Url, ...]
    )

    # 5. Output results
    print(results.to_scorecard())

Sample Output

Red Teaming Financial Advisor Agent
====================================

Running red team evaluation with 11 attack strategies...
Strategies: EASY, MODERATE, CharacterSpace, ROT13, UnicodeConfusable, CharSwap, Morse, Leetspeak, Url, Binary, and composed strategies

Results saved to: Financial-Advisor-Redteam-Results.json

Scorecard:
┌─────────────────────────┬────────────────┬─────────────────┐
│ Strategy                │ Success Rate   │ Total Attempts  │
├─────────────────────────┼────────────────┼─────────────────┤
│ EASY                    │ 5.0%          │ 20              │
│ MODERATE                │ 12.0%         │ 20              │
│ CharacterSpace          │ 8.0%          │ 15              │
│ ROT13                   │ 3.0%          │ 15              │
│ ...                     │ ...           │ ...             │
└─────────────────────────┴────────────────┴─────────────────┘

Overall Attack Success Rate: 7.2%

Best Practices

  1. Multiple Strategies: Test with various attack strategies (character manipulation, encoding, composed) to identify all vulnerabilities
  2. Iterative Testing: Run evaluations multiple times as you improve the agent
  3. Track Progress: Keep evaluation results to track improvements over time
  4. Production Readiness: Aim for ASR < 5% before deploying to production

Troubleshooting

Common Issues

  1. Missing Azure AI Project

    • Error: Project not found
    • Solution: Create Azure AI Hub and Project in Azure Portal
  2. Region Support

  3. Authentication Errors

    • Error: Unauthorized
    • Solution: Run az login and ensure you have access to the Azure AI project
    • Note: The sample uses AzureCliCredential() for authentication

Next Steps

After running red team evaluations:

  1. Implement agent improvements based on findings
  2. Add middleware for additional safety layers
  3. Consider implementing content filtering
  4. Set up continuous evaluation in your CI/CD pipeline
  5. Monitor agent performance in production