Commit Graph

25 Commits

  • feat: make sandbox read access configurable with ReadOnlyAccess (#11387)
    `SandboxPolicy::ReadOnly` previously implied broad read access and could
    not express a narrower read surface.
    This change introduces an explicit read-access model so we can support
    user-configurable read restrictions in follow-up work, while preserving
    current behavior today.
    
    It also ensures unsupported backends fail closed for restricted-read
    policies instead of silently granting broader access than intended.
    
    ## What
    
    - Added `ReadOnlyAccess` in protocol with:
      - `Restricted { include_platform_defaults, readable_roots }`
      - `FullAccess`
    - Updated `SandboxPolicy` to carry read-access configuration:
      - `ReadOnly { access: ReadOnlyAccess }`
      - `WorkspaceWrite { ..., read_only_access: ReadOnlyAccess }`
    - Preserved existing behavior by defaulting current construction paths
    to `ReadOnlyAccess::FullAccess`.
    - Threaded the new fields through sandbox policy consumers and call
    sites across `core`, `tui`, `linux-sandbox`, `windows-sandbox`, and
    related tests.
    - Updated Seatbelt policy generation to honor restricted read roots by
    emitting scoped read rules when full read access is not granted.
    - Added fail-closed behavior on Linux and Windows backends when
    restricted read access is requested but not yet implemented there
    (`UnsupportedOperation`).
    - Regenerated app-server protocol schema and TypeScript artifacts,
    including `ReadOnlyAccess`.
    
    ## Compatibility / rollout
    
    - Runtime behavior remains unchanged by default (`FullAccess`).
    - API/schema changes are in place so future config wiring can enable
    restricted read access without another policy-shape migration.
  • Fix test_shell_command_interruption flake (#10649)
    ## Human summary
    Sandboxing (specifically `LandlockRestrict`) is means that e.g. `sleep
    10` fails immediately. Therefore it cannot be interrupted.
    
    In suite::interrupt::test_shell_command_interruption, sleep 10 is issued
    at 17:28:16.554 (ToolCall: shell_command {"command":"sleep 10"...}),
    then fails at 17:28:16.589 with duration_ms=34, success=false,
    exit_code=101, and
        Sandbox(LandlockRestrict).
    
    ## Codex summary
    - set `sandbox_mode = "danger-full-access"` in `interrupt` and
    `v2/turn_interrupt` integration tests
    - set `sandbox: Some(SandboxMode::DangerFullAccess)` in
    `test_codex_jsonrpc_conversation_flow`
    - set `sandbox_policy: Some(SandboxPolicy::DangerFullAccess)` in
    `command_execution_notifications_include_process_id`
    
    ## Why
    On some Linux CI environments, command execution fails immediately with
    `LandlockRestrict` when sandboxed. These tests are intended to validate
    JSON-RPC/task lifecycle behavior (interrupt semantics, command
    notification shape/process id, request flow), but early sandbox startup
    failure changes turn flow and can trigger extra follow-up requests,
    causing flakes.
    
    This change removes environment-specific sandbox startup dependency from
    these tests while preserving their primary intent.
    
    ## Testing
    - not run in this environment (per request)
  • Fix flakey conversation flow test (#9784)
    I've seen this test fail with:
    
    ```
     - Mock #1.
            	Expected range of matching incoming requests: == 2
            	Number of matched incoming requests: 1
    ```
    
    This is because we pop the wrong task_complete events and then the test
    exits. I think this is because the MCP events are now buffered after
    https://github.com/openai/codex/pull/8874.
    
    So:
    1. clear the buffer before we do any user message sending
    2. additionally listen for task start before task complete
    3. use the ID from task start to find the correct task complete event.
  • Add text element metadata to protocol, app server, and core (#9331)
    The second part of breaking up PR
    https://github.com/openai/codex/pull/9116
    
    Summary:
    
    - Add `TextElement` / `ByteRange` to protocol user inputs and user
    message events with defaults.
    - Thread `text_elements` through app-server v1/v2 request handling and
    history rebuild.
    - Preserve UI metadata only in user input/events (not `ContentItem`)
    while keeping local image attachments in user events for rehydration.
    
    Details:
    
    - Protocol: `UserInput::Text` carries `text_elements`;
    `UserMessageEvent` carries `text_elements` + `local_images`.
    Serialization includes empty vectors for backward compatibility.
    - app-server-protocol: v1 defines `V1TextElement` / `V1ByteRange` in
    camelCase with conversions; v2 uses its own camelCase wrapper.
    - app-server: v1/v2 input mapping includes `text_elements`; thread
    history rebuilds include them.
    - Core: user event emission preserves UI metadata while model history
    stays clean; history replay round-trips the metadata.
  • [chore] move app server tests from chat completion to responses (#8939)
    We are deprecating chat completions. Move all app server tests from chat
    completion to responses.
  • [fix] app server flaky send_messages test (#8874)
    Fix flakiness of CI test:
    https://github.com/openai/codex/actions/runs/20350530276/job/58473691434?pr=8282
    
    This PR does two things:
    1. move the flakiness test to use responses API instead of chat
    completion API
    2. make mcp_process agnostic to the order of
    responses/notifications/requests that come in, by buffering messages not
    read
  • chore: unify conversation with thread name (#8830)
    Done and verified by Codex + refactor feature of RustRover
  • feat: expose outputSchema to user_turn/turn_start app_server API (#8377)
    What changed
    - Added `outputSchema` support to the app-server APIs, mirroring `codex
    exec --output-schema` behavior.
    - V1 `sendUserTurn` now accepts `outputSchema` and constrains the final
    assistant message for that turn.
    - V2 `turn/start` now accepts `outputSchema` and constrains the final
    assistant message for that turn (explicitly per-turn only).
    
    Core behavior
    - `Op::UserTurn` already supported `final_output_json_schema`; now V1
    `sendUserTurn` forwards `outputSchema` into that field.
    - `Op::UserInput` now carries `final_output_json_schema` for per-turn
    settings updates; core maps it into
    `SessionSettingsUpdate.final_output_json_schema` so it applies to the
    created turn context.
    - V2 `turn/start` does NOT persist the schema via `OverrideTurnContext`
    (it’s applied only for the current turn). Other overrides
    (cwd/model/etc) keep their existing persistent behavior.
    
    API / docs
    - `codex-rs/app-server-protocol/src/protocol/v1.rs`: add `output_schema:
    Option<serde_json::Value>` to `SendUserTurnParams` (serialized as
    `outputSchema`).
    - `codex-rs/app-server-protocol/src/protocol/v2.rs`: add `output_schema:
    Option<JsonValue>` to `TurnStartParams` (serialized as `outputSchema`).
    - `codex-rs/app-server/README.md`: document `outputSchema` for
    `turn/start` and clarify it applies only to the current turn.
    - `codex-rs/docs/codex_mcp_interface.md`: document `outputSchema` for v1
    `sendUserTurn` and v2 `turn/start`.
    
    Tests added/updated
    - New app-server integration tests asserting `outputSchema` is forwarded
    into outbound `/responses` requests as `text.format`:
      - `codex-rs/app-server/tests/suite/output_schema.rs`
      - `codex-rs/app-server/tests/suite/v2/output_schema.rs`
    - Added per-turn semantics tests (schema does not leak to the next
    turn):
      - `send_user_turn_output_schema_is_per_turn_v1`
      - `turn_start_output_schema_is_per_turn_v2`
    - Added protocol wire-compat tests for the merged op:
      - serialize omits `final_output_json_schema` when `None`
      - deserialize works when field is missing
      - serialize includes `final_output_json_schema` when `Some(schema)`
    
    Call site updates (high level)
    - Updated all `Op::UserInput { .. }` constructions to include
    `final_output_json_schema`:
      - `codex-rs/app-server/src/codex_message_processor.rs`
      - `codex-rs/core/src/codex_delegate.rs`
      - `codex-rs/mcp-server/src/codex_tool_runner.rs`
      - `codex-rs/tui/src/chatwidget.rs`
      - `codex-rs/tui2/src/chatwidget.rs`
      - plus impacted core tests.
    
    Validation
    - `just fmt`
    - `cargo test -p codex-core`
    - `cargo test -p codex-app-server`
    - `cargo test -p codex-mcp-server`
    - `cargo test -p codex-tui`
    - `cargo test -p codex-tui2`
    - `cargo test -p codex-protocol`
    - `cargo clippy --all-features --tests --profile dev --fix -- -D
    warnings`
  • fix: introduce AbsolutePathBuf as part of sandbox config (#7856)
    Changes the `writable_roots` field of the `WorkspaceWrite` variant of
    the `SandboxPolicy` enum from `Vec<PathBuf>` to `Vec<AbsolutePathBuf>`.
    This is helpful because now callers can be sure the value is an absolute
    path rather than a relative one. (Though when using an absolute path in
    a Seatbelt config policy, we still have to _canonicalize_ it first.)
    
    Because `writable_roots` can be read from a config file, it is important
    that we are able to resolve relative paths properly using the parent
    folder of the config file as the base path.
  • Removed experimental "command risk assessment" feature (#7799)
    This experimental feature received lukewarm reception during internal
    testing. Removing from the code base.
  • Migrate model preset (#7542)
    - Introduce `openai_models` in `/core`
    - Move `PRESETS` under it
    - Move `ModelPreset`, `ModelUpgrade`, `ReasoningEffortPreset`,
    `ReasoningEffortPreset`, and `ReasoningEffortPreset` to `protocol`
    - Introduce `Op::ListModels` and `EventMsg::AvailableModels`
    
    Next steps:
    - migrate `app-server` and `tui` to use the introduced Operation
  • chore: use anyhow::Result for all app-server integration tests (#5836)
    There's a lot of visual noise in app-server's integration tests due to
    the number of `.expect("<some_msg>")` lines which are largely redundant
    / not very useful. Clean them up by using `anyhow::Result` + `?`
    consistently.
    
    Replaces the existing pattern of:
    ```
        let codex_home = TempDir::new().expect("create temp dir");
        create_config_toml(codex_home.path()).expect("write config.toml");
    
        let mut mcp = McpProcess::new(codex_home.path())
            .await
            .expect("spawn mcp process");
        timeout(DEFAULT_READ_TIMEOUT, mcp.initialize())
            .await
            .expect("initialize timeout")
            .expect("initialize request");
    ```
    
    With:
    ```
        let codex_home = TempDir::new()?;
        create_config_toml(codex_home.path())?;
    
        let mut mcp = McpProcess::new(codex_home.path()).await?;
        timeout(DEFAULT_READ_TIMEOUT, mcp.initialize()).await??;
    ```
  • Added model summary and risk assessment for commands that violate sandbox policy (#5536)
    This PR adds support for a model-based summary and risk assessment for
    commands that violate the sandbox policy and require user approval. This
    aids the user in evaluating whether the command should be approved.
    
    The feature works by taking a failed command and passing it back to the
    model and asking it to summarize the command, give it a risk level (low,
    medium, high) and a risk category (e.g. "data deletion" or "data
    exfiltration"). It uses a new conversation thread so the context in the
    existing thread doesn't influence the answer. If the call to the model
    fails or takes longer than 5 seconds, it falls back to the current
    behavior.
    
    For now, this is an experimental feature and is gated by a config key
    `experimental_sandbox_command_assessment`.
    
    Here is a screen shot of the approval prompt showing the risk assessment
    and summary.
    
    <img width="723" height="282" alt="image"
    src="https://github.com/user-attachments/assets/4597dd7c-d5a0-4e9f-9d13-414bd082fd6b"
    />
  • Add new thread items and rewire event parsing to use them (#5418)
    1. Adds AgentMessage,  Reasoning,  WebSearch items.
    2. Switches the ResponseItem parsing to use new items and then also emit
    3. Removes user-item kind and filters out "special" (environment) user
    items when returning to clients.
  • feat: add Vec<ParsedCommand> to ExecApprovalRequestEvent (#5222)
    This adds `parsed_cmd: Vec<ParsedCommand>` to `ExecApprovalRequestEvent`
    in the core protocol (`protocol/src/protocol.rs`), which is also what
    this field is named on `ExecCommandBeginEvent`. Honestly, I don't love
    the name (it sounds like a single command, but it is actually a list of
    them), but I don't want to get distracted by a naming discussion right
    now.
    
    This also adds `parsed_cmd` to `ExecCommandApprovalParams` in
    `codex-rs/app-server-protocol/src/protocol.rs`, so it will be available
    via `codex app-server`, as well.
    
    For consistency, I also updated `ExecApprovalElicitRequestParams` in
    `codex-rs/mcp-server/src/exec_approval.rs` to include this field under
    the name `codex_parsed_cmd`, as that struct already has a number of
    special `codex_*` fields. Note this is the code for when Codex is used
    as an MCP _server_ and therefore has to conform to the official spec for
    an MCP elicitation type.
  • chore: sandbox refactor 2 (#4653)
    Revert the revert and fix the UI issue
  • chore: sanbox extraction (#4286)
    # Extract and Centralize Sandboxing
    - Goal: Improve safety and clarity by centralizing sandbox planning and
    execution.
      - Approach:
    - Add planner (ExecPlan) and backend registry (Direct/Seatbelt/Linux)
    with run_with_plan.
    - Refactor codex.rs to plan-then-execute; handle failures/escalation via
    the plan.
    - Delegate apply_patch to the codex binary and run it with an empty env
    for determinism.
  • fix: remove mcp-types from app server protocol (#4537)
    We continue the separation between `codex app-server` and `codex
    mcp-server`.
    
    In particular, we introduce a new crate, `codex-app-server-protocol`,
    and migrate `codex-rs/protocol/src/mcp_protocol.rs` into it, renaming it
    `codex-rs/app-server-protocol/src/protocol.rs`.
    
    Because `ConversationId` was defined in `mcp_protocol.rs`, we move it
    into its own file, `codex-rs/protocol/src/conversation_id.rs`, and
    because it is referenced in a ton of places, we have to touch a lot of
    files as part of this PR.
    
    We also decide to get away from proper JSON-RPC 2.0 semantics, so we
    also introduce `codex-rs/app-server-protocol/src/jsonrpc_lite.rs`, which
    is basically the same `JSONRPCMessage` type defined in `mcp-types`
    except with all of the `"jsonrpc": "2.0"` removed.
    
    Getting rid of `"jsonrpc": "2.0"` makes our serialization logic
    considerably simpler, as we can lean heavier on serde to serialize
    directly into the wire format that we use now.
  • fix: use macros to ensure request/response symmetry (#4529)
    Manually curating `protocol-ts/src/lib.rs` was error-prone, as expected.
    I finally asked Codex to write some Rust macros so we can ensure that:
    
    - For every variant of `ClientRequest` and `ServerRequest`, there is an
    associated `params` and `response` type.
    - All response types are included automatically in the output of `codex
    generate-ts`.
  • fix: separate codex mcp into codex mcp-server and codex app-server (#4471)
    This is a very large PR with some non-backwards-compatible changes.
    
    Historically, `codex mcp` (or `codex mcp serve`) started a JSON-RPC-ish
    server that had two overlapping responsibilities:
    
    - Running an MCP server, providing some basic tool calls.
    - Running the app server used to power experiences such as the VS Code
    extension.
    
    This PR aims to separate these into distinct concepts:
    
    - `codex mcp-server` for the MCP server
    - `codex app-server` for the "application server"
    
    Note `codex mcp` still exists because it already has its own subcommands
    for MCP management (`list`, `add`, etc.)
    
    The MCP logic continues to live in `codex-rs/mcp-server` whereas the
    refactored app server logic is in the new `codex-rs/app-server` folder.
    Note that most of the existing integration tests in
    `codex-rs/mcp-server/tests/suite` were actually for the app server, so
    all the tests have been moved with the exception of
    `codex-rs/mcp-server/tests/suite/mod.rs`.
    
    Because this is already a large diff, I tried not to change more than I
    had to, so `codex-rs/app-server/tests/common/mcp_process.rs` still uses
    the name `McpProcess` for now, but I will do some mechanical renamings
    to things like `AppServer` in subsequent PRs.
    
    While `mcp-server` and `app-server` share some overlapping functionality
    (like reading streams of JSONL and dispatching based on message types)
    and some differences (completely different message types), I ended up
    doing a bit of copypasta between the two crates, as both have somewhat
    similar `message_processor.rs` and `outgoing_message.rs` files for now,
    though I expect them to diverge more in the near future.
    
    One material change is that of the initialize handshake for `codex
    app-server`, as we no longer use the MCP types for that handshake.
    Instead, we update `codex-rs/protocol/src/mcp_protocol.rs` to add an
    `Initialize` variant to `ClientRequest`, which takes the `ClientInfo`
    object we need to update the `USER_AGENT_SUFFIX` in
    `codex-rs/app-server/src/message_processor.rs`.
    
    One other material change is in
    `codex-rs/app-server/src/codex_message_processor.rs` where I eliminated
    a use of the `send_event_as_notification()` method I am generally trying
    to deprecate (because it blindly maps an `EventMsg` into a
    `JSONNotification`) in favor of `send_server_notification()`, which
    takes a `ServerNotification`, as that is intended to be a custom enum of
    all notification types supported by the app server. So to make this
    update, I had to introduce a new variant of `ServerNotification`,
    `SessionConfigured`, which is a non-backwards compatible change with the
    old `codex mcp`, and clients will have to be updated after the next
    release that contains this PR. Note that
    `codex-rs/app-server/tests/suite/list_resume.rs` also had to be update
    to reflect this change.
    
    I introduced `codex-rs/utils/json-to-toml/src/lib.rs` as a small utility
    crate to avoid some of the copying between `mcp-server` and
    `app-server`.