Files
Ben Thomas 8e54f0b0e7 Python: Shell tool with support for local and Docker (#5664)
* feat(tools): add cross-OS LocalShellTool in new agent-framework-tools package

Introduces a safe, cross-OS local shell tool as the first citizen of a new

agent-framework-tools workspace package. Supports persistent (default) and

stateless modes across pwsh/powershell.exe/bash/sh, with policy denylist,

allowlist, approval gating, process-tree kill on timeout, output truncation,

and audit hooks. Integrates with existing provider get_shell_tool(func=...)

factories via FunctionTool kind='shell'.

See docs/decisions/0026-builtin-tools-local-shell.md for the full design.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): security hardening for LocalShellTool

Codifies what LocalShellTool does and does not defend against, and

delegates the security-relevant lifecycle primitive to a battle-tested

library instead of hand-rolled per-OS code.

Changes:

- Adopt psutil for cross-OS process-tree termination (executor + session).

  Replaces hand-rolled taskkill/killpg with one canonical implementation.

- Resolve taskkill.exe to absolute %SystemRoot%\System32 path so PATH

  poisoning cannot redirect us to an attacker-supplied binary.

- Reframe ShellPolicy docstring + ADR + README: denylist is a guardrail,

  not a security boundary.

- Require acknowledge_unsafe=True to set approval_mode='never_require',

  making the unsafe path explicitly opt-in with a self-documenting name.

- Add tests/test_security.py codifying named CVE-style cases. Defenses

  we DO claim are asserted; non-defenses (denylist bypasses via

  backslash insertion, variable expansion, interpreter escape, base64,

  alternative tools, PowerShell-native verbs) are documented as

  expected-to-pass tests so residual risk stays visible.

- Add Threat Model + Confidence Strategy sections to ADR 0026.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): add DockerShellTool sandboxed shell tier

Adds a container-backed shell executor as the recommended pattern for untrusted-input shell workflows. The container provides the security boundary (--network none, non-root user, --read-only, --cap-drop ALL, no-new-privileges, memory/pids limits, tmpfs /tmp), so approval gating is optional unlike LocalShellTool.

Also introduces a ShellExecutor Protocol so callers can plug in custom backends (Firecracker, SSH, WASI) without forking the framework.

Removes the planned HyperlightShellExecutor follow-up from ADR 0026: Hyperlight is a WASM code sandbox with no kernel/userland/shell binary, so a Hyperlight-backed shell is not viable. Docker is the realistic sandbox tier for shell.

Tests: 11 unit tests for argv builders + lifecycle (no Docker daemon required); 3 integration tests gated on is_docker_available().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): backport shell-tool fixes from .NET parity review

Applies the applicable subset of bug fixes accumulated during the
.NET shell-tool PR review (microsoft/agent-framework#5604) to the
Python shell tool.

A1 - Quote workdir safely in _maybe_reanchor

  Previously _tool.py used double-quote interpolation when emitting
  the cd/Set-Location prefix, which expanded $VAR, $(), and backticks
  in the workdir path. A workdir containing shell metacharacters could
  trigger arbitrary command execution before the user command ran.

  Replaced with single-quote escaping helpers _quote_posix and
  _quote_powershell that emit literal-string forms safe for both
  hosts.

A5/A6 - Consolidate truncation to a single byte-aware helper

  Extracted a shared truncate_head_tail / truncate_text_head_tail
  helper in _truncate.py. The new implementation distributes odd
  caps so head receives floor(cap/2) and tail receives ceil(cap/2)
  bytes, matching the .NET round-9 fix and ensuring no input bytes
  are silently dropped on the boundary.

  _session.py previously truncated by Python str length while the
  caller passed _max_output_bytes - the unit mismatch is now gone:
  raw byte buffers go through truncate_head_tail and decoded text
  goes through truncate_text_head_tail.

Unit tests added for the truncate and quote helpers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(tools): tone down narrative and overconfident comments in shell tool

The shell tool's docstrings and comments contained two patterns that
the .NET review pushed back on:

- Narrative framing about implementation history ("hard-won",
  "we sidestep", "design inspiration: ...", competitor framework
  name-drops in module docstrings).
- Overstated security guarantees ("battle-tested",
  "reasonable for untrusted input", "recommended executor for any
  agent that runs commands from untrusted input",
  "destructive commands are blocked", "safe local shell tool",
  "blocks shell injection").

Rewrites the affected docstrings and comments to describe what the
code does in neutral terms. Behaviour is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): add ShellEnvironmentProvider for the Python shell tool

Ports the .NET ShellEnvironmentProvider as a Python ContextProvider
so agents using LocalShellTool or DockerShellTool can be primed with
an accurate description of the shell they're talking to (family,
version, OS, working directory, and which CLIs are available).

The provider runs probes through any ShellExecutor, caches the
resulting snapshot, and on every before_run extends the session
instructions with a markdown block describing the shell idiom to
use. A failed first probe leaves the cache empty so the next call
retries (no permanent poisoning).

Probe failures from a narrow set of expected error types
(ShellCommandError, ShellExecutionError, ShellTimeoutError, and
asyncio.TimeoutError from the per-probe timeout) are recorded as
None fields in the snapshot. Other exceptions propagate. Tool
names are validated against ^[A-Za-z0-9._-]+$ before being
interpolated into a probe command.

Includes 12 unit tests covering happy path, stderr fallback,
timeout handling, expected/unexpected exception paths, malicious
tool name rejection, case-insensitive deduplication, retry after
failure, concurrent first-callers sharing one probe, and the
default and custom formatter paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(tools): document ShellEnvironmentProvider and finish comment cleanup

Add a README section introducing ShellEnvironmentProvider, soften two remaining overconfident security-boundary comments in _executor_base.py and the DockerShellTool class docstring, and add a sample (shell_with_environment_provider.py) that demonstrates the provider in stateless and persistent modes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(tools): move shell samples to python/samples/02-agents/tools

The repository convention is to host samples under python/samples/ rather than inside the package directory. Move the two net-new shell samples (allow-list and environment-provider) to python/samples/02-agents/tools/ and drop the in-package samples/ directory; the existing top-level providers/openai/client_with_local_shell.py already covers the basic LocalShellTool walkthrough.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(tools): cover confine_workdir default and ShellResult.format_for_model

Two new tests in test_local_shell_tool.py exercise the default confine_workdir=True behaviour on POSIX and PowerShell, asserting that 'cd' inside one persistent-mode call does not leak into the next. A new test_shell_result.py module provides direct unit coverage for every conditional branch of ShellResult.format_for_model (stdout, truncated, stderr, timed_out, exit_code) so regressions in the LLM-facing format are caught immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): address PR #5664 review feedback

- _tool.py: detect PowerShell via is_powershell() helper instead of basename string match

- _environment.py: use public ContextProvider import (no private _ prefix)

- _session.py: trim _stdout_buf/_stderr_buf after copying to avoid unbounded retention across calls

- _docker.py: short-circuit start()/close() in stateless mode; add configurable shell kwarg (default bash, e.g. 'sh' for alpine)

- tests: parenthesized multi-line assert; alpine integration tests now pass shell='sh'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): satisfy CI quality gates

- pyupgrade: drop quoted self-class refs in __aenter__/method annotations

- ruff format: reflow long lines per workspace style

- pyright: assert psutil non-None in optional-import branch; lowercase mutable module globals; annotate _approval_mode as Literal so tool() Literal-typed kwarg is accepted; add ... body to ShellExecutor.run protocol; remove unused deprecated _kill_tree wrapper

- tests: skip docker integration tests on win32 (Windows containers don't support --read-only / alpine images)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove DEFAULT_DENYLIST; document single-session ownership; fix bandit findings

Mirrors the .NET PR #5604 cleanup:

- Remove DEFAULT_DENYLIST from ShellPolicy. ShellPolicy() now ships with an empty deny-list; operators opt into site-specific patterns explicitly. No major agent framework uses regex matching as a primary security control; AutoGen v2 removed theirs. Approval gating + sandbox tier remain the real boundaries.

- Rewrite module / class docstrings to frame ShellPolicy as a UX pre-filter, not a security control.

- Add Single-session ownership paragraphs to ShellExecutor, ShellSession, LocalShellTool, and DockerShellTool: a persistent-mode tool is owned by exactly one conversation / agent session; do not share across users or concurrent conversations.

- Tests now supply explicit deny patterns instead of relying on a default.

- Address Pre-commit Hooks (bandit) CI failures: convert internal-invariant asserts to explicit RuntimeError, annotate intentional subprocess/shell usage with # nosec, document container-internal /tmp paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR #5664 round-2 review feedback

Deny-list documentation drift:

- README and the OpenAI/local-shell sample no longer claim a built-in deny-list of destructive commands. ShellPolicy is described as an optional, operator-supplied UX pre-filter; the real boundaries remain approval gating and the sandbox tier.

Behavioural fixes called out in review:

- ShellPolicy.evaluate() now denies empty / whitespace-only commands explicitly instead of returning allow with no rationale.

- truncate_head_tail() raises ValueError for cap <= 0 instead of silently returning the full input with truncated=False, which previously could defeat output-capping in callers that mis-configured the budget.

- LocalShellTool.as_function() / DockerShellTool.as_function() return the ShellCommandError text directly so the model sees a single, non-redundant 'Command rejected by policy: …' message instead of the prior duplicated 'Command blocked by policy: Command rejected …' wrapping.

- ShellSession POSIX sentinel trailer now snapshots and restores the prior errexit (set -e) state around the trailer, so a user 'set -e' in the persistent shell is no longer permanently disabled by the next run().

Tests:

- New test_shell_parse_rc.py covers the full _parse_rc() edge-case surface (zero, positive, negative, CRLF, no newline, missing prefix, empty input, non-digits, trailing garbage, partial digits).

- test_policy.py asserts the new empty-command deny.

- test_shell_truncate_and_quote.py asserts ValueError for cap=0 and cap<0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback for shell tool

- _resolve.py: reject empty/whitespace shell override string
- _tool.py / _docker.py: mode-aware default tool description (persistent vs stateless)
- _tool.py: fix misleading workdir docstring (re-anchor, not blocking)
- _types.py: emit stream-agnostic [output truncated] marker
- _policy.py: declare _denies/_allows as dataclass fields
- _environment.py: use $(pwd) instead of $PWD in POSIX probe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback: shell override flag + probe timeout safety

- _resolve.py: in stateless mode, ensure shell overrides end with -c/-Command so commands aren't misinterpreted as script-file paths.
- ShellExecutor.run / LocalShellTool.run / DockerShellTool.run now accept an optional 	imeout kwarg; ShellEnvironmentProvider drops the outer asyncio.wait_for and lets the executor enforce the probe timeout internally, so cancellation no longer risks leaving a hung subprocess or corrupted session.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review feedback: docker isolation + lifecycle robustness

- pyproject.toml: bump agent-framework-core minimum from 1.2.0 to 1.2.2 to align with the rest of the workspace.
- _docker.py: validate extra_run_args at construction time and reject flags that would dismantle the isolation defaults (--privileged, --cap-add, --security-opt, --network/--net, -v/--volume/--mount, --device, --pid, --ipc, --userns, --user, --read-only, --tmpfs, --add-host, --gpus, --cgroupns, --device-cgroup-rule); also documented the warning on the docstring.
- _docker._stop_container: retry docker rm -f once and log a warning/error when it does not succeed, so operators can audit leaked containers instead of getting a silent success.
- _docker._run_stateless timeout path: fall back to docker rm -f when docker kill fails or times out (--rm only reaps on clean exit), and log instead of silently swallowing communicate() errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: alliscode <bentho@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: alliscode <25218250+alliscode@users.noreply.github.com>
2026-05-22 00:29:59 +00:00

185 lines
7.3 KiB
Python

# Copyright (c) Microsoft. All rights reserved.
import os
import sys
import pytest
from agent_framework_tools.shell import LocalShellTool, ShellCommandError, ShellPolicy
pytestmark = pytest.mark.asyncio
async def test_stateless_echo() -> None:
tool = LocalShellTool(mode="stateless", approval_mode="never_require", acknowledge_unsafe=True)
cmd = "Write-Output hello" if sys.platform == "win32" else "echo hello"
result = await tool.run(cmd)
assert "hello" in result.stdout
assert result.exit_code == 0
assert result.timed_out is False
async def test_stateless_exit_code_propagates() -> None:
tool = LocalShellTool(mode="stateless", approval_mode="never_require", acknowledge_unsafe=True)
cmd = "exit 7" if sys.platform == "win32" else "sh -c 'exit 7'"
result = await tool.run(cmd)
assert result.exit_code == 7
async def test_stateless_timeout_kills_long_command() -> None:
tool = LocalShellTool(mode="stateless", approval_mode="never_require", acknowledge_unsafe=True, timeout=0.5)
cmd = "Start-Sleep -Seconds 5" if sys.platform == "win32" else "sleep 5"
result = await tool.run(cmd)
assert result.timed_out is True
async def test_policy_denies_before_execution() -> None:
tool = LocalShellTool(
mode="stateless",
approval_mode="never_require",
acknowledge_unsafe=True,
policy=ShellPolicy(denylist=[r"\brm\s+(?:-[a-zA-Z]*[rf][a-zA-Z]*\s+)+(?:/|~|\*)"]),
)
with pytest.raises(ShellCommandError):
await tool.run("rm -rf /")
async def test_allowlist_narrows_to_approved_commands() -> None:
tool = LocalShellTool(
mode="stateless",
approval_mode="never_require",
acknowledge_unsafe=True,
policy=ShellPolicy(allowlist=[r"^echo\b", r"^Write-Output\b"]),
)
cmd = "Write-Output ok" if sys.platform == "win32" else "echo ok"
result = await tool.run(cmd)
assert "ok" in result.stdout
with pytest.raises(ShellCommandError):
await tool.run("ls -la")
async def test_audit_hook_fires_for_allowed_commands() -> None:
seen: list[str] = []
tool = LocalShellTool(
mode="stateless",
approval_mode="never_require",
acknowledge_unsafe=True,
on_command=seen.append,
)
cmd = "Write-Output hi" if sys.platform == "win32" else "echo hi"
await tool.run(cmd)
assert seen == [cmd]
@pytest.mark.skipif(sys.platform == "win32", reason="persistent-mode sentinel on POSIX")
async def test_persistent_preserves_cwd_and_exports_across_calls(tmp_path: os.PathLike[str]) -> None:
async with LocalShellTool(
mode="persistent",
approval_mode="never_require",
acknowledge_unsafe=True,
workdir=str(tmp_path),
confine_workdir=False,
) as tool:
await tool.run("export AGENT_FRAMEWORK_TEST_MARKER=xyz")
result = await tool.run("echo $AGENT_FRAMEWORK_TEST_MARKER")
assert "xyz" in result.stdout
subdir = os.path.join(str(tmp_path), "sub")
os.mkdir(subdir)
await tool.run(f"cd {subdir}")
pwd = await tool.run("pwd")
# subdir resolves to itself modulo symlinks
assert os.path.realpath(pwd.stdout.strip()) == os.path.realpath(subdir)
@pytest.mark.skipif(sys.platform != "win32", reason="PowerShell-specific error handling")
async def test_persistent_powershell_propagates_cmdlet_error() -> None:
"""Cmdlet failures (not just native-process exits) should surface as non-zero rc."""
async with LocalShellTool(mode="persistent", approval_mode="never_require", acknowledge_unsafe=True) as tool:
# Get-Item on a missing path raises; $ErrorActionPreference='Stop' +
# our catch block should map this to exit_code != 0.
result = await tool.run("Get-Item C:\\this\\path\\does\\not\\exist\\for\\af")
assert result.exit_code != 0
assert result.stderr # message surfaced
@pytest.mark.skipif(sys.platform != "win32", reason="PowerShell-specific encoding")
async def test_persistent_powershell_utf8_roundtrip() -> None:
"""Non-ASCII output should round-trip without mojibake."""
async with LocalShellTool(mode="persistent", approval_mode="never_require", acknowledge_unsafe=True) as tool:
result = await tool.run("Write-Output 'café'")
assert "café" in result.stdout
async def test_concurrent_first_calls_do_not_spawn_two_sessions() -> None:
"""Regression: startup must be serialised so two concurrent first callers
don't each spawn their own subprocess."""
import asyncio as _asyncio
tool = LocalShellTool(mode="persistent", approval_mode="never_require", acknowledge_unsafe=True)
try:
cmd = "Write-Output $PID" if sys.platform == "win32" else "echo $$"
r1, r2 = await _asyncio.gather(tool.run(cmd), tool.run(cmd))
assert r1.stdout.strip() == r2.stdout.strip(), (
f"Different PIDs => multiple subprocesses spawned: {r1.stdout!r} vs {r2.stdout!r}"
)
finally:
await tool.close()
@pytest.mark.skipif(sys.platform != "win32", reason="persistent-mode sentinel on PowerShell")
async def test_persistent_preserves_state_powershell(tmp_path: os.PathLike[str]) -> None:
async with LocalShellTool(
mode="persistent",
approval_mode="never_require",
acknowledge_unsafe=True,
workdir=str(tmp_path),
confine_workdir=False,
) as tool:
await tool.run("$env:AGENT_FRAMEWORK_TEST_MARKER = 'xyz'")
result = await tool.run("Write-Output $env:AGENT_FRAMEWORK_TEST_MARKER")
assert "xyz" in result.stdout
r2 = await tool.run("$x = 42; Write-Output $x")
assert "42" in r2.stdout
async def test_as_function_wires_kind_and_approval() -> None:
tool = LocalShellTool(approval_mode="always_require")
ft = tool.as_function(name="shell_exec")
assert ft.name == "shell_exec"
assert ft.kind == "shell"
assert ft.approval_mode == "always_require"
@pytest.mark.skipif(sys.platform == "win32", reason="POSIX persistent reanchor test")
async def test_persistent_confines_workdir_by_default(tmp_path: os.PathLike[str]) -> None:
"""With the default ``confine_workdir=True``, a ``cd`` in one call
must not leak into the next: each command is reanchored to ``workdir``."""
subdir = os.path.join(str(tmp_path), "sub")
os.mkdir(subdir)
async with LocalShellTool(
mode="persistent",
approval_mode="never_require",
acknowledge_unsafe=True,
workdir=str(tmp_path),
) as tool:
await tool.run(f"cd {subdir}")
pwd = await tool.run("pwd")
assert os.path.realpath(pwd.stdout.strip()) == os.path.realpath(str(tmp_path))
@pytest.mark.skipif(sys.platform != "win32", reason="PowerShell persistent reanchor test")
async def test_persistent_confines_workdir_by_default_powershell(tmp_path: os.PathLike[str]) -> None:
"""PowerShell counterpart of the POSIX confinement check."""
subdir = os.path.join(str(tmp_path), "sub")
os.mkdir(subdir)
async with LocalShellTool(
mode="persistent",
approval_mode="never_require",
acknowledge_unsafe=True,
workdir=str(tmp_path),
) as tool:
await tool.run(f"Set-Location -LiteralPath '{subdir}'")
pwd = await tool.run("(Get-Location).Path")
assert os.path.realpath(pwd.stdout.strip()) == os.path.realpath(str(tmp_path))