Files
agent-framework/python/packages/tools/tests/test_shell_environment_provider.py
Ben Thomas 8e54f0b0e7 Python: Shell tool with support for local and Docker (#5664)
* feat(tools): add cross-OS LocalShellTool in new agent-framework-tools package

Introduces a safe, cross-OS local shell tool as the first citizen of a new

agent-framework-tools workspace package. Supports persistent (default) and

stateless modes across pwsh/powershell.exe/bash/sh, with policy denylist,

allowlist, approval gating, process-tree kill on timeout, output truncation,

and audit hooks. Integrates with existing provider get_shell_tool(func=...)

factories via FunctionTool kind='shell'.

See docs/decisions/0026-builtin-tools-local-shell.md for the full design.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): security hardening for LocalShellTool

Codifies what LocalShellTool does and does not defend against, and

delegates the security-relevant lifecycle primitive to a battle-tested

library instead of hand-rolled per-OS code.

Changes:

- Adopt psutil for cross-OS process-tree termination (executor + session).

  Replaces hand-rolled taskkill/killpg with one canonical implementation.

- Resolve taskkill.exe to absolute %SystemRoot%\System32 path so PATH

  poisoning cannot redirect us to an attacker-supplied binary.

- Reframe ShellPolicy docstring + ADR + README: denylist is a guardrail,

  not a security boundary.

- Require acknowledge_unsafe=True to set approval_mode='never_require',

  making the unsafe path explicitly opt-in with a self-documenting name.

- Add tests/test_security.py codifying named CVE-style cases. Defenses

  we DO claim are asserted; non-defenses (denylist bypasses via

  backslash insertion, variable expansion, interpreter escape, base64,

  alternative tools, PowerShell-native verbs) are documented as

  expected-to-pass tests so residual risk stays visible.

- Add Threat Model + Confidence Strategy sections to ADR 0026.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): add DockerShellTool sandboxed shell tier

Adds a container-backed shell executor as the recommended pattern for untrusted-input shell workflows. The container provides the security boundary (--network none, non-root user, --read-only, --cap-drop ALL, no-new-privileges, memory/pids limits, tmpfs /tmp), so approval gating is optional unlike LocalShellTool.

Also introduces a ShellExecutor Protocol so callers can plug in custom backends (Firecracker, SSH, WASI) without forking the framework.

Removes the planned HyperlightShellExecutor follow-up from ADR 0026: Hyperlight is a WASM code sandbox with no kernel/userland/shell binary, so a Hyperlight-backed shell is not viable. Docker is the realistic sandbox tier for shell.

Tests: 11 unit tests for argv builders + lifecycle (no Docker daemon required); 3 integration tests gated on is_docker_available().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): backport shell-tool fixes from .NET parity review

Applies the applicable subset of bug fixes accumulated during the
.NET shell-tool PR review (microsoft/agent-framework#5604) to the
Python shell tool.

A1 - Quote workdir safely in _maybe_reanchor

  Previously _tool.py used double-quote interpolation when emitting
  the cd/Set-Location prefix, which expanded $VAR, $(), and backticks
  in the workdir path. A workdir containing shell metacharacters could
  trigger arbitrary command execution before the user command ran.

  Replaced with single-quote escaping helpers _quote_posix and
  _quote_powershell that emit literal-string forms safe for both
  hosts.

A5/A6 - Consolidate truncation to a single byte-aware helper

  Extracted a shared truncate_head_tail / truncate_text_head_tail
  helper in _truncate.py. The new implementation distributes odd
  caps so head receives floor(cap/2) and tail receives ceil(cap/2)
  bytes, matching the .NET round-9 fix and ensuring no input bytes
  are silently dropped on the boundary.

  _session.py previously truncated by Python str length while the
  caller passed _max_output_bytes - the unit mismatch is now gone:
  raw byte buffers go through truncate_head_tail and decoded text
  goes through truncate_text_head_tail.

Unit tests added for the truncate and quote helpers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(tools): tone down narrative and overconfident comments in shell tool

The shell tool's docstrings and comments contained two patterns that
the .NET review pushed back on:

- Narrative framing about implementation history ("hard-won",
  "we sidestep", "design inspiration: ...", competitor framework
  name-drops in module docstrings).
- Overstated security guarantees ("battle-tested",
  "reasonable for untrusted input", "recommended executor for any
  agent that runs commands from untrusted input",
  "destructive commands are blocked", "safe local shell tool",
  "blocks shell injection").

Rewrites the affected docstrings and comments to describe what the
code does in neutral terms. Behaviour is unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(tools): add ShellEnvironmentProvider for the Python shell tool

Ports the .NET ShellEnvironmentProvider as a Python ContextProvider
so agents using LocalShellTool or DockerShellTool can be primed with
an accurate description of the shell they're talking to (family,
version, OS, working directory, and which CLIs are available).

The provider runs probes through any ShellExecutor, caches the
resulting snapshot, and on every before_run extends the session
instructions with a markdown block describing the shell idiom to
use. A failed first probe leaves the cache empty so the next call
retries (no permanent poisoning).

Probe failures from a narrow set of expected error types
(ShellCommandError, ShellExecutionError, ShellTimeoutError, and
asyncio.TimeoutError from the per-probe timeout) are recorded as
None fields in the snapshot. Other exceptions propagate. Tool
names are validated against ^[A-Za-z0-9._-]+$ before being
interpolated into a probe command.

Includes 12 unit tests covering happy path, stderr fallback,
timeout handling, expected/unexpected exception paths, malicious
tool name rejection, case-insensitive deduplication, retry after
failure, concurrent first-callers sharing one probe, and the
default and custom formatter paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs(tools): document ShellEnvironmentProvider and finish comment cleanup

Add a README section introducing ShellEnvironmentProvider, soften two remaining overconfident security-boundary comments in _executor_base.py and the DockerShellTool class docstring, and add a sample (shell_with_environment_provider.py) that demonstrates the provider in stateless and persistent modes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(tools): move shell samples to python/samples/02-agents/tools

The repository convention is to host samples under python/samples/ rather than inside the package directory. Move the two net-new shell samples (allow-list and environment-provider) to python/samples/02-agents/tools/ and drop the in-package samples/ directory; the existing top-level providers/openai/client_with_local_shell.py already covers the basic LocalShellTool walkthrough.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(tools): cover confine_workdir default and ShellResult.format_for_model

Two new tests in test_local_shell_tool.py exercise the default confine_workdir=True behaviour on POSIX and PowerShell, asserting that 'cd' inside one persistent-mode call does not leak into the next. A new test_shell_result.py module provides direct unit coverage for every conditional branch of ShellResult.format_for_model (stdout, truncated, stderr, timed_out, exit_code) so regressions in the LLM-facing format are caught immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): address PR #5664 review feedback

- _tool.py: detect PowerShell via is_powershell() helper instead of basename string match

- _environment.py: use public ContextProvider import (no private _ prefix)

- _session.py: trim _stdout_buf/_stderr_buf after copying to avoid unbounded retention across calls

- _docker.py: short-circuit start()/close() in stateless mode; add configurable shell kwarg (default bash, e.g. 'sh' for alpine)

- tests: parenthesized multi-line assert; alpine integration tests now pass shell='sh'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(tools): satisfy CI quality gates

- pyupgrade: drop quoted self-class refs in __aenter__/method annotations

- ruff format: reflow long lines per workspace style

- pyright: assert psutil non-None in optional-import branch; lowercase mutable module globals; annotate _approval_mode as Literal so tool() Literal-typed kwarg is accepted; add ... body to ShellExecutor.run protocol; remove unused deprecated _kill_tree wrapper

- tests: skip docker integration tests on win32 (Windows containers don't support --read-only / alpine images)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove DEFAULT_DENYLIST; document single-session ownership; fix bandit findings

Mirrors the .NET PR #5604 cleanup:

- Remove DEFAULT_DENYLIST from ShellPolicy. ShellPolicy() now ships with an empty deny-list; operators opt into site-specific patterns explicitly. No major agent framework uses regex matching as a primary security control; AutoGen v2 removed theirs. Approval gating + sandbox tier remain the real boundaries.

- Rewrite module / class docstrings to frame ShellPolicy as a UX pre-filter, not a security control.

- Add Single-session ownership paragraphs to ShellExecutor, ShellSession, LocalShellTool, and DockerShellTool: a persistent-mode tool is owned by exactly one conversation / agent session; do not share across users or concurrent conversations.

- Tests now supply explicit deny patterns instead of relying on a default.

- Address Pre-commit Hooks (bandit) CI failures: convert internal-invariant asserts to explicit RuntimeError, annotate intentional subprocess/shell usage with # nosec, document container-internal /tmp paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR #5664 round-2 review feedback

Deny-list documentation drift:

- README and the OpenAI/local-shell sample no longer claim a built-in deny-list of destructive commands. ShellPolicy is described as an optional, operator-supplied UX pre-filter; the real boundaries remain approval gating and the sandbox tier.

Behavioural fixes called out in review:

- ShellPolicy.evaluate() now denies empty / whitespace-only commands explicitly instead of returning allow with no rationale.

- truncate_head_tail() raises ValueError for cap <= 0 instead of silently returning the full input with truncated=False, which previously could defeat output-capping in callers that mis-configured the budget.

- LocalShellTool.as_function() / DockerShellTool.as_function() return the ShellCommandError text directly so the model sees a single, non-redundant 'Command rejected by policy: …' message instead of the prior duplicated 'Command blocked by policy: Command rejected …' wrapping.

- ShellSession POSIX sentinel trailer now snapshots and restores the prior errexit (set -e) state around the trailer, so a user 'set -e' in the persistent shell is no longer permanently disabled by the next run().

Tests:

- New test_shell_parse_rc.py covers the full _parse_rc() edge-case surface (zero, positive, negative, CRLF, no newline, missing prefix, empty input, non-digits, trailing garbage, partial digits).

- test_policy.py asserts the new empty-command deny.

- test_shell_truncate_and_quote.py asserts ValueError for cap=0 and cap<0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback for shell tool

- _resolve.py: reject empty/whitespace shell override string
- _tool.py / _docker.py: mode-aware default tool description (persistent vs stateless)
- _tool.py: fix misleading workdir docstring (re-anchor, not blocking)
- _types.py: emit stream-agnostic [output truncated] marker
- _policy.py: declare _denies/_allows as dataclass fields
- _environment.py: use $(pwd) instead of $PWD in POSIX probe

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address PR review feedback: shell override flag + probe timeout safety

- _resolve.py: in stateless mode, ensure shell overrides end with -c/-Command so commands aren't misinterpreted as script-file paths.
- ShellExecutor.run / LocalShellTool.run / DockerShellTool.run now accept an optional 	imeout kwarg; ShellEnvironmentProvider drops the outer asyncio.wait_for and lets the executor enforce the probe timeout internally, so cancellation no longer risks leaving a hung subprocess or corrupted session.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Address review feedback: docker isolation + lifecycle robustness

- pyproject.toml: bump agent-framework-core minimum from 1.2.0 to 1.2.2 to align with the rest of the workspace.
- _docker.py: validate extra_run_args at construction time and reject flags that would dismantle the isolation defaults (--privileged, --cap-add, --security-opt, --network/--net, -v/--volume/--mount, --device, --pid, --ipc, --userns, --user, --read-only, --tmpfs, --add-host, --gpus, --cgroupns, --device-cgroup-rule); also documented the warning on the docstring.
- _docker._stop_container: retry docker rm -f once and log a warning/error when it does not succeed, so operators can audit leaked containers instead of getting a silent success.
- _docker._run_stateless timeout path: fall back to docker rm -f when docker kill fails or times out (--rm only reaps on clean exit), and log instead of silently swallowing communicate() errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: alliscode <bentho@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: alliscode <25218250+alliscode@users.noreply.github.com>
2026-05-22 00:29:59 +00:00

356 lines
11 KiB
Python

# Copyright (c) Microsoft. All rights reserved.
"""Unit tests for :class:`ShellEnvironmentProvider`."""
from __future__ import annotations
import asyncio
from typing import Any
import pytest
from agent_framework_tools.shell import (
ShellCommandError,
ShellEnvironmentProvider,
ShellEnvironmentProviderOptions,
ShellExecutionError,
ShellFamily,
ShellResult,
default_instructions_formatter,
)
pytestmark = pytest.mark.asyncio
class _FakeExecutor:
"""In-memory ShellExecutor stub. Maps command-prefix -> response."""
def __init__(self, responses: dict[str, ShellResult | Exception | float]) -> None:
self._responses = responses
self.start_calls = 0
self.run_calls: list[str] = []
async def start(self) -> None:
self.start_calls += 1
async def close(self) -> None: ...
async def __aenter__(self) -> _FakeExecutor:
await self.start()
return self
async def __aexit__(self, *_: object) -> None:
await self.close()
async def run(self, command: str, *, timeout: float | None = None) -> ShellResult:
self.run_calls.append(command)
for prefix, response in self._responses.items():
if command.startswith(prefix) or prefix in command:
if isinstance(response, Exception):
raise response
if isinstance(response, (int, float)):
# Honor timeout in the fake the same way a real executor
# is required to: stop sleeping when timeout elapses and
# report a timed-out result rather than blocking forever.
sleep_for = float(response)
if timeout is not None and sleep_for > timeout:
await asyncio.sleep(timeout)
return ShellResult(
stdout="",
stderr="",
exit_code=124,
duration_ms=0,
timed_out=True,
)
await asyncio.sleep(sleep_for)
return ShellResult(stdout="", stderr="", exit_code=0, duration_ms=0)
return response
return ShellResult(stdout="", stderr="", exit_code=127, duration_ms=0)
def _ok(stdout: str = "", stderr: str = "", exit_code: int = 0) -> ShellResult:
return ShellResult(stdout=stdout, stderr=stderr, exit_code=exit_code, duration_ms=1)
async def test_probe_collects_shell_version_cwd_and_tools() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5.2.21\nCWD=/repo\n"),
"git --version": _ok(stdout="git version 2.40.0\n"),
"node --version": _ok(stdout="v20.11.1\n"),
})
options = ShellEnvironmentProviderOptions(
probe_tools=("git", "node", "missing-tool"),
override_family=ShellFamily.POSIX,
)
provider = ShellEnvironmentProvider(executor, options)
snapshot = await provider.refresh()
assert snapshot.family is ShellFamily.POSIX
assert snapshot.shell_version == "5.2.21"
assert snapshot.working_directory == "/repo"
assert snapshot.tool_versions["git"] == "git version 2.40.0"
assert snapshot.tool_versions["node"] == "v20.11.1"
assert snapshot.tool_versions["missing-tool"] is None
assert executor.start_calls >= 1
async def test_probe_falls_back_to_stderr_for_version_when_stdout_empty() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=unknown\nCWD=/x\n"),
"java --version": _ok(stdout="", stderr="openjdk 21 2024-09-17\n"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("java",),
override_family=ShellFamily.POSIX,
),
)
snapshot = await provider.refresh()
assert snapshot.tool_versions["java"] == "openjdk 21 2024-09-17"
assert snapshot.shell_version is None # "unknown" is normalised away
async def test_probe_timeout_yields_none_field_not_exception() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5.0\nCWD=/r\n"),
"git --version": 5.0, # sleeps 5s, probe_timeout below is 0.05s
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("git",),
override_family=ShellFamily.POSIX,
probe_timeout=0.05,
),
)
snapshot = await provider.refresh()
assert snapshot.tool_versions["git"] is None
async def test_probe_swallows_expected_executor_failures() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5\nCWD=/r\n"),
"git --version": ShellCommandError("blocked"),
"node --version": ShellExecutionError("spawn failed"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("git", "node"),
override_family=ShellFamily.POSIX,
),
)
snapshot = await provider.refresh()
assert snapshot.tool_versions == {"git": None, "node": None}
async def test_unexpected_exception_propagates() -> None:
class Boom(RuntimeError): ...
executor = _FakeExecutor({"echo": Boom("kaboom")})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=(),
override_family=ShellFamily.POSIX,
),
)
with pytest.raises(Boom):
await provider.refresh()
async def test_invalid_tool_name_is_rejected_before_probing() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5\nCWD=/r\n"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("git; rm -rf /", "good", ""),
override_family=ShellFamily.POSIX,
),
)
snapshot = await provider.refresh()
assert snapshot.tool_versions["git; rm -rf /"] is None
# Verify no probe command was actually issued for the malicious entry.
assert not any("git; rm -rf /" in c for c in executor.run_calls)
async def test_duplicate_tools_are_deduplicated_case_insensitively() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5\nCWD=/r\n"),
"git --version": _ok(stdout="git version 2\n"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("git", "GIT", "Git"),
override_family=ShellFamily.POSIX,
),
)
snapshot = await provider.refresh()
assert list(snapshot.tool_versions.keys()) == ["git"]
async def test_failed_probe_does_not_poison_subsequent_calls() -> None:
calls = {"n": 0}
class Flaky:
start_calls = 0
async def start(self) -> None:
self.start_calls += 1
async def close(self) -> None: ...
async def __aenter__(self) -> Flaky:
return self
async def __aexit__(self, *_: object) -> None: ...
async def run(self, command: str, *, timeout: float | None = None) -> ShellResult:
calls["n"] += 1
if calls["n"] == 1:
raise RuntimeError("transient")
return _ok(stdout="VERSION=5\nCWD=/r\n")
provider = ShellEnvironmentProvider(
Flaky(),
ShellEnvironmentProviderOptions(
probe_tools=(),
override_family=ShellFamily.POSIX,
),
)
with pytest.raises(RuntimeError):
await provider._get_or_probe() # type: ignore[attr-defined]
snapshot = await provider._get_or_probe() # type: ignore[attr-defined]
assert snapshot.shell_version == "5"
async def test_concurrent_first_callers_share_a_single_probe() -> None:
started = asyncio.Event()
release = asyncio.Event()
call_count = {"n": 0}
class Slow:
async def start(self) -> None: ...
async def close(self) -> None: ...
async def __aenter__(self) -> Slow:
return self
async def __aexit__(self, *_: object) -> None: ...
async def run(self, command: str, *, timeout: float | None = None) -> ShellResult:
if command.startswith("echo"):
call_count["n"] += 1
started.set()
await release.wait()
return _ok(stdout="VERSION=5\nCWD=/r\n")
return _ok()
provider = ShellEnvironmentProvider(
Slow(),
ShellEnvironmentProviderOptions(
probe_tools=(),
override_family=ShellFamily.POSIX,
),
)
a = asyncio.create_task(provider._get_or_probe()) # type: ignore[attr-defined]
b = asyncio.create_task(provider._get_or_probe()) # type: ignore[attr-defined]
await started.wait()
release.set()
s1, s2 = await asyncio.gather(a, b)
assert s1 is s2
assert call_count["n"] == 1
async def test_before_run_extends_instructions() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5.2.21\nCWD=/repo\n"),
"git --version": _ok(stdout="git version 2.40.0\n"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=("git",),
override_family=ShellFamily.POSIX,
),
)
received: list[tuple[str, Any]] = []
class FakeContext:
def extend_instructions(self, source_id: str, instructions: Any) -> None:
received.append((source_id, instructions))
await provider.before_run(
agent=None, # type: ignore[arg-type]
session=None, # type: ignore[arg-type]
context=FakeContext(), # type: ignore[arg-type]
state={},
)
assert len(received) == 1
src, text = received[0]
assert src == "shell_environment"
assert "POSIX shell 5.2.21" in text
assert "Working directory: /repo" in text
assert "git (git version 2.40.0)" in text
async def test_default_formatter_powershell_block_uses_pwsh_idioms() -> None:
from agent_framework_tools.shell import ShellEnvironmentSnapshot
snapshot = ShellEnvironmentSnapshot(
family=ShellFamily.POWERSHELL,
os_description="Windows 11",
shell_version="7.4.0",
working_directory=r"C:\repo",
tool_versions={"git": "2.40", "rust": None},
)
text = default_instructions_formatter(snapshot)
assert "PowerShell 7.4.0" in text
assert "$env:NAME" in text
assert r"C:\repo" in text
assert "Available CLIs: git (2.40)" in text
assert "Not installed: rust" in text
async def test_custom_formatter_is_used_when_provided() -> None:
executor = _FakeExecutor({
"echo": _ok(stdout="VERSION=5\nCWD=/r\n"),
})
provider = ShellEnvironmentProvider(
executor,
ShellEnvironmentProviderOptions(
probe_tools=(),
override_family=ShellFamily.POSIX,
instructions_formatter=lambda snap: f"FAMILY={snap.family.value}",
),
)
received: list[tuple[str, Any]] = []
class FakeContext:
def extend_instructions(self, source_id: str, instructions: Any) -> None:
received.append((source_id, instructions))
await provider.before_run(
agent=None, # type: ignore[arg-type]
session=None, # type: ignore[arg-type]
context=FakeContext(), # type: ignore[arg-type]
state={},
)
assert received[0][1] == "FAMILY=posix"