3916 Commits

  • [codex] [3/4] Activate endpoint plugin recommendations (#27704)
    Summary\n- Await endpoint recommendation selection while constructing
    each authenticated turn, removing the first-turn cache race.\n- Snapshot
    and filter endpoint candidates once per turn, then use that same set for
    the bounded contextual user fragment, tool exposure, and exact install
    validation.\n- Keep recommendation selection ephemeral: do not persist
    recommendation state in or gate resumed threads on prior context.\n-
    Hide the legacy list tool in endpoint mode and preserve legacy discovery
    unchanged when the endpoint is disabled or unavailable.\n- Keep remote
    plugin and connector app identities out of model-visible context and
    attach them only to Codex-owned elicitation metadata.\n\nStack\n- 3/4,
    based on #28400.\n- Endpoint client and cache: #28399.\n- Generalized
    suggestion presentation: #28400.\n- Install-schema follow-up:
    #28403.\n\nValidation\n- \n- \n- \n- \n- Full : 2,649 passed and 88
    environment-dependent tests failed because this sandbox cannot write ,
    nest Seatbelt, or locate auxiliary test binaries.
  • [codex] [2/4] Generalize plugin suggestion presentation (#28400)
    Summary
    - Add list-backed and developer-context presentations for plugin
    suggestion candidates.
    - Let tool planning, install validation, and request-tool copy follow
    the selected presentation.
    - Keep every production caller on the existing list-backed presentation,
    preserving the current list tool, request schema, connector behavior,
    and model-visible copy.
    - Leave developer-context presentation latent until the final PR in the
    stack.
    
    Stack
    - 2/3, based on #28399.
    - Follow-up: #27704 activates endpoint recommendations.
    
    Validation
    - `just test -p codex-core request_plugin_install`
    - `just test -p codex-core spec_plan`
    - `just fix -p codex-core`
    - `just fmt`
    - `git diff --check`
  • app-server: preserve target-native environment cwd (#28146)
    ## Why
    
    app-server may run on a different OS from the selected exec-server
    environment. Parsing that environment’s cwd with the Codex host’s path
    rules prevents thread startup.
    
    ## What
    
    Carry environment cwd values as `LegacyAppPathString` at the app-server
    boundary and `PathUri` internally. Existing tool-call schemas and
    relative-path behavior stay host-native; remaining local-only consumers
    convert explicitly and leave follow-up TODOs.
    
    The Wine integration test verifies app-server can start a thread and
    complete an ordinary turn with a Windows environment cwd from Linux.
    
    ## Validation
    
    - `bazel test //codex-rs/core/tests/remote_env_windows:smoke-test
    --test_output=errors`
    - focused app-server environment-selection and protocol schema tests
    - scoped Clippy for `codex-core` and `codex-app-server-protocol`
  • Clarify model-generated and legacy app path types (#28577)
    ## Why
    
    `ApiPathString` kind of implies that it can be used anywhere we pull a
    path out of JSON, but it's not really appropriate for tool arguments
    when the model might generate relative paths.
    
    Prefer `String` for model-generated paths and we can handle the
    conversion per feature for now and define a shared abstraction later if
    it makes sense.
    
    # What
    
    Rename `ApiPathString` to `AppLegacyPathString` to clarify its role.
    
    Expand the `path-types` skill to tell the model to leave tool args as
    bare strings.
  • [codex] test exec relative additional permissions (#28587)
    ## Why
    
    Review caught some would-be regressions in changes to unified_exec that
    weren't surfaced in CI.
    
    ## What
    
    Add coverage for requesting permissions through unified exec when there
    are additional permissions. Previously this flow was only tested against
    shell_command.
  • code-mode: extend test coverage to lock in cell lifecycle (#28468)
    This PR establishes the intended behavior as an executable contract
    before a refactor of the cell runtime begins. It also fixes cases where
    a second observer or termination request could replace an existing
    response channel and leave the original caller unresolved.
    
    ### Behavior codified
    - A cell can yield output and subsequently resume to completion.
    - A caller can run a cell until it has no immediately runnable work,
    receive its accumulated output and outstanding tool-call IDs, and then
    resume the same cell when the awaited work is available.
    - Each cell admits one active observer:
       - a second observer receives an explicit busy error
       - the existing observer remains registered and is not displaced
    - A natural result (conclusion of the js module) that has already
    reached the cell controller wins over a later termination request.
    - Otherwise, termination preempts execution and resolves both:
      - the active observer, if present
      - the caller requesting termination
    - Repeated termination requests are rejected while termination is
    already in progress.
    - Terminal responses are sent only after outstanding callback work has
    been handled:
    - natural completion drains notifications and cancels outstanding tool
    calls
    - termination cancels and drains both notification and tool callbacks.
    - Cell removal and cell_closed notification happen after callback
    cleanup
  • [codex] re-enable absolute workdir integration test (#28581)
    ## Why
    
    In #28146 I missed the invariant that an absolute `exec_command` workdir
    must override the environment cwd. The existing integration test would
    have caught that regression, but it was ignored as flaky.
    
    ## What
    
    Re-enable `unified_exec_respects_workdir_override`.
    
    ## Validation
    
    `just test -p codex-core unified_exec_respects_workdir_override`
  • [codex] Route MCP file uploads through environment filesystem (#27923)
    ## Why
    
    Codex Apps tools can mark arguments with `openai/fileParams`, but the
    execution path resolved and opened those files directly on the host.
    That bypassed the selected turn environment and prevented annotated file
    arguments from working with remote environments.
    
    ## What changed
    
    - resolve annotated file arguments against the primary turn environment
    - read file metadata and contents through that environment's sandboxed
    `ExecutorFileSystem`
    - reject files over the 512 MiB limit from metadata before reading or
    transferring them
    - retain the buffered upload-size check as defense in depth
    - make the OpenAI upload API accept a filename and buffered contents
    instead of owning local filesystem access
    - describe the model-visible argument as a path in the primary
    environment
    
    This builds on #27927, which added `size` to internal filesystem
    metadata.
    
    ## Testing
    
    - `just test -p codex-api upload_openai_file_returns_canonical_uri`
    - `just test -p codex-mcp
    tool_with_model_visible_input_schema_masks_file_params`
    - `just test -p codex-core mcp_openai_file`
    - `just test -p codex-core
    codex_apps_file_params_upload_environment_files_before_mcp_tool_call`
  • [codex] Warn clearly when code mode output is truncated (#28467)
    ## Summary
    
    - make `formatted_truncate_text` prepend `Warning: truncated output
    (original token count: N)` above the existing `Total output lines`
    header
    - update direct formatter, unified-exec, user-shell, and code-mode
    expectations
    - add core unit coverage that runs in Bazel without requiring the
    skipped V8-backed code-mode integration suite
    
    ## Validation
    
    - `cargo test -p codex-utils-output-truncation -- --nocapture` (17
    passed)
    - `cargo test -p codex-core --lib
    truncated_text_output_starts_with_warning -- --nocapture`
    - `cargo test -p codex-core --test all
    clamps_model_requested_max_output_tokens_to_policy -- --nocapture` (2
    passed)
    - `cargo test -p codex-core --test all
    unified_exec_formats_large_output_summary -- --nocapture`
    - `cargo test -p codex-core --test all
    user_shell_command_output_is_truncated_in_history -- --nocapture`
    - Bazel CI exercises the shared formatter and downstream integration
    expectations
  • [codex] exec-server: stream files in chunks (#28354)
    ## Why
    
    `fs/readFile` buffers the entire file in one response, which makes large
    remote reads expensive and prevents callers from applying backpressure.
    We need an opt-in streaming path with bounded block sizes while
    preserving the existing single-call API for small and sandboxed reads.
    
    ## What changed
    
    - Add `ExecServerClient::stream`, returning a named `FileReadStream`
    that implements `futures::Stream` and yields immutable 1 MiB byte
    blocks.
    - Add internal `fs/open`, `fs/readBlock`, and `fs/close` RPCs.
    `fs/readBlock` accepts an explicit offset and length.
    - Keep unsandboxed files open between block reads, cap open handles per
    connection, and clean them up on EOF, error, stream drop, explicit
    close, or connection shutdown.
    - Reject platform-sandboxed streaming opens instead of turning the
    one-shot sandbox helper into a persistent server. Existing `fs/readFile`
    behavior is unchanged.
    
    ## Testing
    
    - `just test -p codex-exec-server`
    - Integration coverage for 1 MiB chunking, exact block-boundary EOF,
    sandbox rejection, and continued reads from the opened file after path
    replacement.
    - Handle-manager coverage for non-sequential offsets, variable block
    lengths, the 128-handle limit, and capacity release after close.
  • core: surface terminal subagent errors to parent agents (#28375)
    ## Why
    
    When a subagent exhausts its retries, it emits an `Error`, but the
    generic task lifecycle then emits `TurnComplete(None)`. That completion
    used to overwrite the subagent's `Errored` status with
    `Completed(None)`, so the parent received an empty completion
    notification.
    
    This made a failed child look indistinguishable from a child that
    completed without an answer. In unattended or long-running multi-agent
    work, the root could silently continue without knowing that delegated
    work failed or how to restart it.
    
    ## Behavior
    
    Before, a terminal stream failure was reduced to an empty completion:
    
    ```text
    <subagent_notification>
    {"agent_path":"/root/worker","status":{"completed":null}}
    </subagent_notification>
    ```
    
    Now the parent receives the actual terminal error, bounded to 1,000
    tokens, together with an actionable recovery hint:
    
    ```text
    <subagent_notification>
    {
      "agent_path": "/root/worker",
      "status": {
        "errored": "stream disconnected before completion: stream closed before response.completed"
      },
      "next_action": "This agent's turn failed. If you still need this agent, use `followup_task` to give it another task."
    }
    </subagent_notification>
    ```
    
    The notification remains queue-only: it does not wake the root or replay
    the failed request. The root sees it at the next sampling boundary and
    can use `followup_task` to start a new turn for that agent.
    
    ## What changed
    
    - Added terminal-error precedence to the [agent status
    reducer](https://github.com/openai/codex/blob/e95fcfe2bb6a02f1a75650afa20048859f556511/codex-rs/core/src/agent/status.rs#L23-L34),
    so a closing `TurnComplete` cannot erase an immediately preceding
    `Errored` status.
    - Made MultiAgentV2 completion forwarding use the retained session
    status instead of re-deriving `Completed(None)` from the final event.
    - Extended the [subagent notification
    fragment](https://github.com/openai/codex/blob/e95fcfe2bb6a02f1a75650afa20048859f556511/codex-rs/core/src/context/subagent_notification.rs#L6-L60)
    with a `next_action` for terminal errors and a hard cap on model-visible
    error text.
    - Kept successful completions and interrupted turns unchanged.
    
    ## Verification
    
    - Added a status-reducer test proving that `Errored` survives the
    trailing `TurnComplete`.
    - Added an integration test that exhausts a subagent's stream retries
    and verifies the exact `agent_message` delivered to the parent,
    including the error and `followup_task` guidance.
    - Re-ran the existing successful-completion and interrupted-turn
    notification tests.
  • [tests] Keep Apps out of generic core test harness (#28508)
    ## Summary
    
    - disable the stable Apps feature in the generic `test_codex()`
    integration-test harness
    - keep Apps-specific tests explicit: their builders re-enable Apps and
    point it at a local mock server
    
    ## Why
    
    Generic tests that use dummy ChatGPT auth were also enabling the
    host-owned `codex_apps` MCP server. That made unrelated tests contact
    `chatgpt.com` and wait for MCP startup, causing the Bazel timeouts
    observed on #28368.
    
    The generic harness should be hermetic and should not start an external
    service that the test did not request. This is test-only; production
    Apps behavior is unchanged. The broader optional-MCP startup behavior is
    being handled separately in #28407.
    
    ## Testing
    
    - `just test -p codex-core -E
    'test(pre_sampling_compact_runs_when_comp_hash_changes) |
    test(model_switch_to_smaller_model_updates_token_context_window) |
    test(codex_apps_file_params_upload_local_paths_before_mcp_tool_call)'`
    - `just fix -p codex-core`
    - `just fmt`
  • feat: render typed envelopes for multi-agent v2 messages (#28368)
    ## Why
    
    Multi-agent v2 messages need a consistent, model-visible envelope that
    identifies what kind of interaction occurred, who sent it, and which
    agent it targets. Previously, encrypted deliveries exposed only
    `encrypted_content`, while child completion used the legacy
    `<subagent_notification>` shape. That meant the client could not
    consistently present `NEW_TASK`, `MESSAGE`, and `FINAL_ANSWER` using the
    same format.
    
    This change adds the routing envelope as plaintext while keeping task
    and message payloads encrypted. No new Responses API field is required:
    an encrypted delivery is represented as an `input_text` header
    immediately followed by its existing `encrypted_content` item.
    
    Every envelope now follows this shape:
    
    ```text
    Message Type: <NEW_TASK | MESSAGE | FINAL_ANSWER>
    Task name: <recipient agent path>
    Sender: <author agent path>
    Payload:
    <message payload>
    ```
    
    ## Message types
    
    ### `NEW_TASK`
    
    `NEW_TASK` is used when the recipient should begin a new turn, including
    an initial `spawn_agent` task and a later `followup_task`.
    
    For a root agent spawning `/root/worker`, the request contains a
    plaintext envelope followed by the encrypted task:
    
    ```json
    {
      "type": "agent_message",
      "author": "/root",
      "recipient": "/root/worker",
      "content": [
        {
          "type": "input_text",
          "text": "Message Type: NEW_TASK\nTask name: /root/worker\nSender: /root\nPayload:\n"
        },
        {
          "type": "encrypted_content",
          "encrypted_content": "<encrypted task payload>"
        }
      ]
    }
    ```
    
    Conceptually, the model receives:
    
    ```text
    Message Type: NEW_TASK
    Task name: /root/worker
    Sender: /root
    Payload:
    Review the authentication changes and report any regressions.
    ```
    
    ### `MESSAGE`
    
    `MESSAGE` is used for a queued `send_message` delivery. It communicates
    with an existing agent without starting a new turn.
    
    For `/root/worker` reporting progress to the root agent, the request
    contains:
    
    ```json
    {
      "type": "agent_message",
      "author": "/root/worker",
      "recipient": "/root",
      "content": [
        {
          "type": "input_text",
          "text": "Message Type: MESSAGE\nTask name: /root\nSender: /root/worker\nPayload:\n"
        },
        {
          "type": "encrypted_content",
          "encrypted_content": "<encrypted message payload>"
        }
      ]
    }
    ```
    
    Conceptually, the model receives:
    
    ```text
    Message Type: MESSAGE
    Task name: /root
    Sender: /root/worker
    Payload:
    The protocol tests pass; I am checking the resume path now.
    ```
    
    ### `FINAL_ANSWER`
    
    `FINAL_ANSWER` is emitted when a child agent reaches a terminal state
    and reports its result to its parent. Completion payloads are already
    available locally, so the complete envelope is represented as plaintext
    rather than as a plaintext header plus encrypted content.
    
    For `/root/worker` completing work for the root agent, the request
    contains:
    
    ```json
    {
      "type": "agent_message",
      "author": "/root/worker",
      "recipient": "/root",
      "content": [
        {
          "type": "input_text",
          "text": "Message Type: FINAL_ANSWER\nTask name: /root\nSender: /root/worker\nPayload:\nNo regressions found."
        }
      ]
    }
    ```
    
    The model-visible form is:
    
    ```text
    Message Type: FINAL_ANSWER
    Task name: /root
    Sender: /root/worker
    Payload:
    No regressions found.
    ```
    
    Errored, shut down, and missing agents also use `FINAL_ANSWER`, with a
    terminal-status description in the payload.
    
    ## What changed
    
    - Render `NEW_TASK` or `MESSAGE` in
    `InterAgentCommunication::to_model_input_item`, based on whether the
    encrypted delivery starts a turn.
    - Replace the multi-agent v2 `<subagent_notification>` completion
    payload with a model-visible `FINAL_ANSWER` envelope.
    - Document `Task name`, `Sender`, and `Payload` consistently in the
    multi-agent developer instructions.
    - Prevent local-only history projections from treating an encrypted
    message's plaintext header as the complete assistant message.
    - Preserve rollout-trace interaction edges when an agent message
    contains both plaintext and encrypted content.
    
    Legacy multi-agent behavior remains unchanged.
    
    ## Verification
    
    - `just test -p codex-protocol`
    - `just test -p codex-rollout-trace`
    - `just test -p codex-web-search-extension`
    - `just test -p codex-core
    encrypted_multi_agent_v2_spawn_sends_agent_message_to_child`
    - `just test -p codex-core
    plaintext_multi_agent_v2_completion_sends_agent_message`
    - `just test -p codex-core
    multi_agent_v2_followup_task_completion_notifies_parent_on_every_turn`
    - `just test -p codex-core
    multi_agent_v2_completion_queues_message_for_direct_parent`
  • [codex] Use local environment for user shell commands (#28163)
    ## Why
    
    User shell commands still read the legacy turn cwd and session shell
    even though execution context is now owned by selected turn
    environments. App-server also defines `thread/shellCommand` as a
    local-host escape hatch, so it must use an available local environment
    even when a remote environment is primary.
    
    ## What changed
    
    - Add `ResolvedTurnEnvironments::local()` to find the selected local
    environment.
    - Resolve the user shell command cwd and shell from that local
    `TurnEnvironment`.
    - Emit the standard `shell is unavailable in this session` error when no
    selected local environment or resolved local shell is available.
    - Add an integration test covering `/shell` without a local environment.
    
    ## Test plan
    
    - `just test -p codex-core
    user_shell_command_without_local_environment_emits_error`
  • [codex] Use expect in integration tests (#28441)
    The workspace denies `clippy::expect_used` in production. Although
    `clippy.toml` allows `expect` in tests, Bazel Clippy compiles
    integration-test helper code in a way that does not receive that
    exemption, which encouraged verbose `unwrap_or_else(... panic!(...))`
    and equivalent `match`/`let else` forms.
    
    This allows `clippy::expect_used` once at each integration-test crate
    root (including aggregated suites and test-support libraries), then
    replaces manual panic-based Result and Option unwraps with
    `expect`/`expect_err`. Standalone `tests/*.rs` files remain their own
    crate roots. Intentional assertion and unexpected-variant panics remain
    unchanged, and the production `expect_used = "deny"` lint remains in
    place.
    
    The cleanup is mechanical and net-negative in line count.
  • [codex] Add interruptible sleep tool (#28429)
    ## Why
    
    Models sometimes need to pause briefly while waiting for external work,
    but using a shell command for that delay ties the wait to a process and
    does not naturally resume when new turn input arrives.
    
    ## What changed
    
    - add a built-in `sleep` tool behind the under-development `sleep_tool`
    feature
    - accept a bounded `duration_ms` argument, matching the millisecond
    convention used by unified exec
    - end the sleep early when either steered user input or mailbox input
    arrives
    - include elapsed wall-clock time in completed and interrupted outputs
    - emit a dedicated core `SleepItem` through `item/started` and
    `item/completed`
    - expose the sleep item as app-server v2 `ThreadItem::Sleep` and retain
    it in reconstructed thread history
    - regenerate the configuration schema for the new feature flag
    - regenerate app-server JSON and TypeScript schema fixtures
    
    ## Test plan
    
    - `just test -p codex-core sleep_tool_follows_feature_gate`
    - `just test -p codex-core any_new_input_interrupts_sleep`
    - `just test -p codex-app-server-protocol`
    - `just test -p codex-app-server
    sleep_emits_started_and_completed_items`
  • [codex] Bind shell snapshots to retained thread environments (#28421)
    ## Why
    
    Shell snapshots are currently session-scoped even though shell and cwd
    are properties of a selected turn environment. That makes snapshot
    refresh depend on separate session-cwd plumbing, prevents retained
    environments from retaining their snapshot work, and can make snapshot
    construction use a different shell than command execution.
    
    This follows #27955 by making the retained thread-environment service
    own environment snapshot lifecycles. Session configuration remains the
    requested selection state, while `ThreadEnvironments` remains the source
    of successfully resolved environments.
    
    ## What changed
    
    - Configure the shell-snapshot builder before initial environment
    resolution.
    - Start each local environment snapshot task when its `TurnEnvironment`
    is built and retain that shared task while environment ID and cwd still
    match.
    - Inherit retained environment snapshots into spawned child threads.
    - Carry the selected `TurnEnvironment` through shell runtimes so
    snapshot construction and command execution use the same
    environment-specific shell and cwd.
    - Load project instructions and warm plugins/skills after initial
    environment resolution.
    - Continue decoding invalid UTF-8 instruction files lossily without
    emitting a startup warning.
    - Keep requested selections in `SessionConfiguration`; failed or
    duplicate resolutions only affect the resolved environment snapshot.
    
    ## Validation
    
    - `cargo check -p codex-core --tests`
    - `just test -p codex-home instructions` (6 passed)
    - Focused environment, instruction, shell-snapshot, and user-shell tests
    (84 passed)
    - Focused shell-snapshot, user-shell, and unified-exec tests (126
    passed; two event-timing tests passed on retry)
  • Run core integration tests against a Wine-backed Windows executor (#28401)
    ## Why
    
    We want to exercise a linux app-server against a windows exec-server
    without having to repeat every test case. This approach has slight
    precedent in the remote docker test setup.
    
    ## What
    
    Run the shared `codex-core` integration suite against Windows
    exec-server behavior from Linux. This makes cross-OS path and shell
    regressions visible while keeping unsupported cases owned by individual
    tests.
    
    - Add `local`, `docker`, and `wine-exec` test environment selection with
    legacy Docker compatibility.
    - Extend `codex_rust_crate` to generate a sharded Wine-exec variant
    using a cross-built Windows server and pinned Bazel Wine/PowerShell
    runtimes.
    - Teach remote-aware helpers about Windows paths and track temporary
    incompatibilities with source-local `skip_if_wine_exec!` calls and
    follow-up reasons.
  • Add a toggle for realtime startup context (#28405)
    ## Summary
    - Add `includeStartupContext` to realtime start requests so callers can
    explicitly skip Codex startup context while keeping the backend prompt
    - Thread the new flag through protocol types, request processing, and
    realtime session config
    - Update app-server docs and coverage for the new default and opt-out
    behavior
    
    ## Testing
    - Added protocol serialization coverage for `includeStartupContext`
    - Added realtime integration coverage for starting a session with
    startup context disabled
  • [codex] Fix missing response item metadata in tests (#28415)
    Summary
    - Add the two missing `metadata: None` initializers after #28355 made
    response-item metadata required.
    - Restore test compilation for `codex-core` and `codex-api` on main.
    
    Validation
    - `git diff --check`
    - `just fmt` (Rust formatting passed; unrelated Python formatter steps
    could not use the sandboxed shared `uv` cache)
    - Focused crate tests are running after PR creation.
  • Use PathUri in filesystem permission paths for exec-server (#28165)
    ## Why
    
    Progress towards letting app-server and exec-server run on different
    platforms, specifically for sandbox configuration.
    
    ## What
    
    - Make the filesystem path containment hierarchy generic, defaulting to
    `AbsolutePathBuf` for now.
    - Have clients specify `AbsolutePathBuf` or `PathUri` directly where
    needed.
    - Use `PathUri` throughout exec-server filesystem protocol and trait
    boundaries.
    - Implement `From` for conversion to path URIs and `TryFrom` for
    fallible conversion to absolute paths through the generic type
    hierarchy.
  • Add realtime speech append control (#27917)
    ## Why
    
    Realtime voice harness tuning needs app-side control over what backend
    Codex text is spoken. Backend orchestrator text is written for a reading
    UI, so automatically speaking every preamble, progress update, or final
    assistant message can make the realtime voice model too chatty.
    
    For experimentation, clients need two simple controls: keep app/client
    text-item injection on the existing item-create path, and add an
    explicit speakable path that app code can call only when it wants
    realtime to speak. Automatic Codex output also needs an opt-in way to
    switch from the protocol's default speakable path to regular realtime
    items, with a caller-provided prefix so prompt wording can be tuned
    outside core.
    
    The default remains unchanged: if a client omits the new start fields
    and never calls `appendSpeech`, automatic backend output continues down
    the existing speakable path for the selected realtime protocol.
    
    ## What Changed
    
    - Adds experimental `thread/realtime/appendSpeech` for app-provided
    speakable text.
    - Keeps existing `thread/realtime/appendText` as the item-create API for
    app-provided realtime text items.
    - Adds `codexResponsesAsItems` / `codex_responses_as_items` on
    `thread/realtime/start` to send automatic Codex responses with
    `conversation.item.create` instead of the protocol's default speakable
    output path.
    - Adds `codexResponseItemPrefix` / `codex_response_item_prefix` so
    clients can prepend experiment instructions to those automatic Codex
    response items.
    - Keeps literal `conversation.handoff.append` routing scoped to the v1
    speakable path; v2 default speech uses its item/function-output plus
    `response.create` behavior.
    - Removes the earlier public silent-context API and hardcoded
    silent-context prefix.
    - Updates realtime tests to cover default automatic speakable behavior,
    opt-in automatic item-create behavior, and explicit `appendSpeech`
    behavior.
    
    ## Validation
    
    - `cargo check -p codex-core -p codex-app-server -p codex-api`
    - `just test -p codex-app-server realtime_conversation`
    - `just test -p codex-core realtime_conversation` (50/51 passed in the
    filtered parallel run; the lone failure passed when rerun in isolation)
    - `just test -p codex-core
    conversation_mirrors_assistant_message_text_to_realtime_handoff`
    - `just test -p codex-api
    e2e_connect_and_exchange_events_against_mock_ws_server`
    - `just fix -p codex-core`
    - `just fix -p codex-app-server`
    - `cargo build -p codex-cli`
  • [codex] retain resolved environments across turns (#27955)
    ## Why
    
    Selected execution environments are thread-scoped resources, but startup
    and turn construction repeatedly resolved their IDs and working
    directories. That discarded existing environment handles and shell
    metadata even when a selection had not changed.
    
    Session configuration updates also need to affect future turns without
    changing the resolved environment set already captured by a running
    turn.
    
    ## What changed
    
    - Create a `ThreadEnvironments` service inside `Codex` from the spawned
    `EnvironmentManager` and raw environment selections, then store it on
    `SessionServices`.
    - Split service construction from `update_selections`, allowing session
    configuration updates to mutate the resolved set in place.
    - Retain an existing `TurnEnvironment` when its environment ID and
    working directory match; resolve only added or changed selections and
    remove selections that are no longer present.
    - Normalize duplicate IDs by keeping the first selection and skip
    individual selections that fail to resolve instead of rejecting the
    entire update.
    - Give each `TurnContext` a cloned `TurnEnvironmentSnapshot`, so later
    session configuration updates affect future turns without rewriting an
    active turn.
    - Reuse the service-owned environment manager and resolved snapshot for
    startup work, MCP initialization, and child-thread spawning instead of
    flowing resolved environments through spawn arguments.
    
    ## Test plan
    
    - `cargo check -p codex-core --tests`
    - `just test -p codex-core environment_selection`
    - `just test -p codex-core turn_environments`
    - `just test -p codex-core
    session_update_settings_does_not_rewrite_sticky_environment_cwds`
    - `just test -p codex-core
    default_turn_does_not_overlay_legacy_fallback_cwd_onto_stored_thread_environments`
  • Deflake realtime handoff steering test (#28300)
    ## Summary
    - keep the realtime mock websocket open for the handoff steering test
    after scripted responses
    - avoid racing the mock server close before the standalone handoff
    append is observed, which was showing up as a Windows timeout in CI
    
    __Details__:
    Failures in samples seem to be caused by:
    1. The mock websocket sends conversation.handoff.requested.
    2. The mock immediately closes the websocket because
    start_websocket_server(...) defaults to close_after_requests: true.
    3. On Windows, that close often surfaces as os error 10053 / 10054.
    4. The realtime stream shuts down before the routed handoff finishes
    creating/steering the follow-up request.
    5. The test waits for the expected follow-up event and times out.
    
    The PR changes only step 2: for this test, the mock websocket stays open
    after sending the scripted handoff event. The same handoff event is
    still sent, and the test still asserts the important steering behavior:
    1. first Responses request has the original prompt
    2. first request does not contain realtime delegation
    3. second Responses request does contain the realtime delegation
    
    ## Validation
    - `just fmt`
    - `just test -p codex-core --test all
    suite::realtime_conversation::inbound_handoff_request_steers_active_turn`
    
    ## Recent CI failures with the same signature
    
    -
    https://github.com/openai/codex/actions/runs/27538033492/job/81392362858
      - 2026-06-15, `[codex] update multi-agent v2 prompts`
    - same test failed after `conversation.handoff.requested`; websocket
    read failed with `os error 10053`
    
    -
    https://github.com/openai/codex/actions/runs/27543877820/job/81412200651
    - 2026-06-15, `feat: dispatch queued user messages through core idle
    extensions`
      - same test failed; websocket read failed with `os error 10054`
    
    -
    https://github.com/openai/codex/actions/runs/27544342375/job/81413801641
      - 2026-06-15, `[codex] Make marketplace loading capability aware`
      - same test failed; websocket read failed with `os error 10053`
  • [codex] Reuse Apps policy evaluation across MCP tool exposure (#27813)
    ## Summary
    
    - move `AppToolPolicyEvaluator` and the Apps config/requirements policy
    logic from `codex-core` into `codex-connectors`
    - resolve one immutable policy snapshot per exposure build and reuse it
    across every Codex Apps MCP tool
    - keep core as a thin adapter from MCP metadata to connector-owned
    policy input while preserving the call-time defense-in-depth check
    
    ## Why
    
    `build_mcp_tool_exposure` evaluates every Codex Apps tool on each
    sampling request. The old path rebuilt effective Apps configuration for
    every tool, and the policy implementation lived in the already-large
    core crate even though it is connector-specific.
    
    The connector-owned evaluator keeps the expensive config merge/decode
    out of the loop and gives core only the effective policy result it
    needs.
    
    ## Performance
    
    With the real 557-tool Apps corpus, `build_mcp_tool_exposure` measured
    3.74 ms and 3.33 ms after the extraction (3.54 ms mean). The original
    path measured 807 ms mean, so the final result retains the 99.6%
    reduction.
    
    ## Validation
    
    - `cargo check -p codex-connectors -p codex-core`
    - `just test -p codex-connectors` — 15 passed
    - `just test -p codex-core --lib connectors` — 35 passed
    - `just test -p codex-core --lib mcp_tool_exposure` — 5 passed
    - `just test -p codex-core --lib mcp_tool_call` — 72 passed
    - `just bazel-lock-update`
    - `just bazel-lock-check`
    - `just fix -p codex-connectors`
    - `just fix -p codex-core`
    - `just fmt`
  • Respect blocking PostToolUse hooks in code mode (#28365)
    ## Summary
    
    Make blocking hook behavior reliable for tools invoked from code mode.
    
    Previously, a `PostToolUse` hook could block a completed tool result,
    but code mode would still return the original typed result to
    JavaScript. The hook appeared blocked in hook telemetry while the
    running script continued with the result.
    
    This change:
    
    - rejects the nested JavaScript tool promise when `PostToolUse` blocks
    - normalizes `decision: "block"` and exit code 2 to the same blocking
    behavior
    - surfaces the hook feedback as the rejected promise's error
    - adds end-to-end coverage for the relevant PreToolUse and PostToolUse
    interactions
    
    ## Hook semantics in code mode
    
    | Hook behavior | Code-mode result |
    |---|---|
    | PreToolUse block | Reject the promise before the tool executes |
    | PreToolUse `updatedInput` | Execute the rewritten invocation and
    return its result |
    | PostToolUse `decision: "block"` | Execute the tool, then reject the
    promise with the hook reason |
    | PostToolUse exit code 2 | Same behavior as `decision: "block"` |
    | PostToolUse `continue: false` | Preserve the existing feedback-only
    behavior; do not reject the promise |
    
    ## Test coverage
    
    Added or strengthened end-to-end coverage proving that:
    
    - a PreToolUse block rejects the JavaScript promise before execution
    - a PreToolUse input rewrite executes only the rewritten command
    - JavaScript receives the rewritten command's result
    - PostToolUse `decision: "block"` rejects after the command executes
    - PostToolUse exit code 2 has the same behavior
    - the hook observes the original completed tool response
    - the blocked original result does not reach JavaScript
    - existing direct-mode replacement behavior remains intact
    - `continue: false` without a reason produces deterministic fallback
    feedback
  • [codex] Add created-by-me remote plugin marketplace (#28203)
    ## Summary
    - add the `created-by-me-remote` marketplace backed by paginated
    `scope=USER` plugin directory and installed-plugin requests
    - include USER plugins in installed-plugin caching, bundle sync, and
    stale-cache cleanup without client-side discoverability filtering
    - expose the marketplace through app-server v2 and regenerate the
    protocol schemas
    
    ## Testing
    - `cargo build -p codex-app-server --bin codex-app-server`
    - production-auth `plugin/list` smoke test for `created-by-me-remote`
    (returned the expected USER plugin as installed and enabled)
    - `just test -p codex-core-plugins` (221 passed)
    - `just test -p codex-app-server-protocol` (231 passed)
    - `just test -p codex-app-server suite::v2::plugin_list::` (37 passed)
    - `just fix -p codex-core-plugins -p codex-app-server-protocol -p
    codex-app-server`
    - `just fmt`
  • feat(core): add metadata field to ResponseItem (#28355)
    ## Description
    
    This PR adds an optional `metadata` field to `ResponseItem` for
    Responses API calls. Only mechanical plumbing, no actual values
    populated and sent yet. Turns out just adding a new field to
    `ResponseItem` has quite a large blast radius already.
    
    This change is backwards compatible because `metadata` is optional and
    omitted when absent, so existing response items and rollout history
    without it still deserialize and requests that do not set it keep the
    same wire shape. For provider compatibility, we strip out `metadata`
    before non-OpenAI Responses requests so Azure and AWS Bedrock never see
    this field.
    
    My followup PR here will actually make use of it to start storing and
    passing along `turn_id`: https://github.com/openai/codex/pull/28360
    
    ## What changed
    
    - Added `ResponseItemMetadata` with optional `turn_id`, plus optional
    `metadata` on Responses API item variants and inter-agent communication.
    - Preserved item metadata through response-item rewrites such as
    truncation, missing tool-output synthesis, compaction history
    rebuilding, visible-history conversion, rollout/resume, and generated
    app-server schemas/types.
    - Strip item metadata from non-OpenAI Responses requests while
    preserving it for OpenAI-shaped requests.
    - Updated the mechanical fixture/test construction churn required by the
    new optional field.
  • core: cache the tool search handler per session (#27258)
    ## Why
    
    Tool router construction rebuilds the deferred-tool BM25 index during
    session initialization and before each sampling continuation, even when
    the searchable tool metadata is unchanged. Local profiling measured
    `append_tool_search_executor` at roughly 113 ms per continuation, making
    repeated index construction the largest measured router-building cost.
    
    ## What changed
    
    - Add a session-scoped `ToolSearchHandlerCache` so continuations and
    user turns can reuse the existing handler.
    - Key reuse on the complete ordered `Vec<ToolSearchInfo>`, rebuilding
    when searchable text, loadable tool specs, source metadata, or ordering
    changes.
    - Build handlers outside the cache lock and recheck before publishing
    them, avoiding holding the mutex during index construction.
    
    ## Verification
    
    - `cache_reuses_identical_search_infos_and_rebuilds_changed_inputs`
    covers exact cache reuse and invalidation when the ordered search
    metadata changes.
    - Local rollout profiling showed the initial router build populating the
    cache and unchanged later continuations reusing it:
      - uncached: 118 ms median across 14 spans from 3 rollouts
      - cached: 4 ms median across 12 spans from 3 rollouts
  • Add hidden Windows sandbox wrapper entrypoint (#28358)
    ## Why
    
    This is the second PR in the Windows fs-helper sandbox stack. The
    fs-helper path needs a Windows sandbox launcher that has the same
    argv-shaped contract as macOS `sandbox-exec` and `codex-linux-sandbox`,
    but this PR only introduces that hidden launcher. It does not route
    fs-helper through it yet.
    
    The hidden launcher still needs to be policy-complete before later
    direct-spawn callers use it. In particular, it has to carry the same
    Windows sandbox policy details that the existing spawn paths already
    understand: proxy enforcement, read/write root overrides, and
    deny-read/deny-write overrides.
    
    ## What Changed
    
    - Added the hidden `codex.exe --run-as-windows-sandbox` arg1 dispatch
    path.
    - Added `windows-sandbox-rs/src/wrapper.rs`, which parses the wrapper
    argv, launches the requested command through the shared Windows sandbox
    session runner from PR1, and forwards stdio.
    - Added `create_windows_sandbox_command_args_for_permission_profile()`
    so later direct-spawn callers can build the wrapper argv consistently.
    - Made the wrapper argv round-trip the full Windows sandbox policy
    surface it needs later: workspace roots, environment, permission
    profile, sandbox level, private desktop, proxy enforcement, read/write
    root overrides, and deny-read/deny-write overrides.
    - Carried `proxy_enforced` through the shared Windows session request so
    proxy-managed executions continue to use the offline/elevated sandbox
    identity.
    - Added wrapper argument round-trip coverage for the full policy fields.
    
    ## Verification
    
    - `just test -p codex-windows-sandbox windows_wrapper_args_round_trip`
    - `just test -p codex-arg0`
    - `just test -p codex-core exec::tests::windows_`
    - `just fix -p codex-windows-sandbox -p codex-core -p codex-cli`
    
    Local note: the full `just fmt` command still fails on this workstation
    in non-Rust formatter setup (`uv` cache access denied and missing
    `dotslash`/buildifier), but the Rust formatter phase completed.
  • Add Windows unified exec yield floor (#27086)
    ## Why
    
    The Windows `unified_exec` experiment regressed at the turn level in a
    way that points to premature backgrounding / extra command cycles rather
    than individual responses getting heavier:
    
    - `codex_local_tool_calls_per_turn` was up about 20.7%.
    - `codex_local_blended_tokens_per_turn` was up about 4.1%, and
    `codex_local_output_tokens_per_turn` was up about 4.0%.
    - `codex_local_response_latency_per_turn` was up about 8.3%.
    - The primary activity metrics also moved down: `codex_turns` about
    -6.6%, `codex_dau` about -1.0%, and `codex_local_hourly_active_users`
    about -3.0%.
    
    At the same time, the per-response metrics moved in the other direction:
    blended tokens per response, output tokens per response, and latency per
    response were all lower in test. That suggests the bad turn-level shape
    is largely about extra tool/model cycles, not each response being slower
    or more expensive on its own.
    
    Local Windows benchmarking showed the likely mechanism: shell-wrapped
    commands pay a large PowerShell startup/teardown tax before the actual
    command has much time to run. In the benchmark, the PowerShell wrapper
    added roughly 0.7-1.0s versus direct exec:
    
    - Windows PowerShell: about 740ms p50 / 800ms p90 overhead versus direct
    exec.
    - PowerShell 7 (`pwsh`): about 930ms p50 / 980ms p90 overhead versus
    direct exec.
    
    The model commonly asks for a 1s initial yield. On Windows, that can
    spend nearly the whole window waiting on PowerShell machinery, so
    otherwise-short commands are more likely to return as background
    sessions and require follow-up polling/tool calls.
    
    This is intentionally a temporary unlock. It gives Windows closer to the
    same useful post-shell command window as other platforms while we work
    on reducing the PowerShell tax directly, for example with persistent
    PowerShell workers or conservative direct-exec paths for commands that
    do not need shell semantics.
    
    ## What changed
    
    - Adds a Windows-only 2s floor to `unified_exec`'s initial
    `yield_time_ms` clamp.
    - Keeps larger model-requested waits unchanged, including the existing
    10s default.
    - Keeps the existing 30s max clamp.
    - Leaves non-Windows behavior unchanged.
    - Adds platform-gated tests for both the Windows floor and the
    non-Windows clamp behavior.
    
    ## Verification
    
    - `just test -p codex-core unified_exec`
  • core: let steer interrupt wait_agent (#28341)
    ## Why
    
    `wait_agent` can block for a long timeout while waiting for sub-agent
    mailbox activity. Although same-turn user steer is accepted during that
    tool call, the input remains pending until the wait returns, so an
    explicit request to change direction can appear unresponsive.
    
    ## What changed
    
    - Notify active `wait_agent` calls when user input is steered into the
    current turn.
    - Check for already-pending steer input when subscribing so input that
    races with tool startup is not missed.
    - Distinguish mailbox activity, steered input, and timeout outcomes,
    returning `Wait interrupted by new input.` for the steer path.
    - Update the `wait_agent` tool description to document the early-return
    behavior.
    
    ## Testing
    
    - `just test -p codex-core input_queue_`
    - `just test -p codex-core wait_agent`
    
    The coverage includes steer notification before and after subscription,
    plus an end-to-end test that verifies the interrupted wait result and
    steered user input are both included exactly once in the follow-up model
    request.
  • guardian: isolate review context from skills and memories (#28285)
    ## Why
    
    Guardian reviews embed the parent session transcript as untrusted
    evidence. Skill or plugin mentions in that transcript must not be
    interpreted as requests to inject more instructions into the Guardian
    request, and memory context adds unrelated model-visible context to an
    approval decision.
    
    Keeping those sources out of the nested review session makes the request
    smaller and preserves the trust boundary around the transcript being
    assessed.
    
    ## What changed
    
    - Skip skill and plugin discovery when building turns for Guardian
    reviewer sessions.
    - Disable memory context and dedicated memory tools in the derived
    Guardian configuration.
    - Extend the Guardian request-layout coverage to verify that a `$skill`
    mention remains visible only as transcript evidence while neither the
    skill body nor memory context is injected.
    - Expand the Guardian configuration test to cover the disabled memory
    settings.
    
    ## Testing
    
    - Updated the Guardian review request snapshot and assertions for skill
    and memory isolation.
    - Extended the Guardian session configuration test to cover memories.
  • [codex] preserve explicit environment cwd (#27995)
    ## Why
    
    `TurnEnvironmentSelections::new` rewrote the primary environment's
    explicit `cwd` to the legacy fallback cwd. For a remote-first selection,
    this could replace the remote working directory with a local fallback
    path and made the legacy cwd overlay authoritative over
    environment-owned state.
    
    ## What changed
    
    - Preserve every explicit environment cwd when constructing turn
    environment selections.
    - Keep `cwd`-only app-server updates compatible by rebuilding the
    default environment selections at the requested cwd.
    - Cover both explicit primary cwd preservation and cwd-only updates
    reaching the model-visible execution environment.
    
    ## Testing
    
    - `just test -p codex-core
    session_update_settings_does_not_rewrite_sticky_environment_cwds`
    - `just test -p codex-core
    environment_settings_preserve_explicit_primary_cwd`
    - `just test -p codex-app-server
    thread_settings_update_cwd_retargets_default_environment`
  • [codex] remove stale PathExt import (#28344)
    ## Why
    
    `main` fails dev-profile Cargo and Bazel Clippy builds because
    `core/src/tools/runtimes/mod_tests.rs` imports `PathExt` after its last
    use was removed. With warnings denied, that stale import prevents
    `codex-core` test targets from compiling across platforms.
    
    ## What changed
    
    Remove the unused `PathExt` import. Remaining `.abs()` calls in the
    module operate on `PathBuf` and continue to use `PathBufExt`.
    
    ## Validation
    
    - `just fmt`
    - Focused `codex-core` test compile attempted; blocked locally by disk
    exhaustion before compilation completed. The CI failure itself is the
    unused-import diagnostic this change removes.
  • avoid cloning websocket request history (#28313)
    ## Why
    
    WebSocket continuations only send the new part of a request. Checking
    whether a request could be continued was cloning the full previous
    request, the current request, and their input history.
    
    For long conversations or large tool lists, that meant copying several
    request-sized values on every continuation.
    
    ## What changed
    
    - compare the request settings by reference
    - check the previous input and server response as borrowed prefixes
    - allocate only the new input items that will be sent
    
    The reuse rules stay the same, including ignoring `client_metadata` for
    this check.
    
    The comparison is still `O(n)`, but it removes several `O(n)`
    allocations and copies. Temporary memory no longer grows by multiple
    full request sizes for each continuation.
    
    ## Performance
    
    Local rollout traces show continuation checks on turns around 260k input
    tokens. Before this change the reuse gate cloned the previous request,
    the current request, and the previous input history before deciding
    whether it could continue incrementally. After this change it borrows
    those structures and allocates only the incremental tail. For large
    continuations with a small delta, that removes roughly three
    request-sized copies from the hot path and reduces temporary memory from
    multiple full request sizes to just the new tail.
    
    ## Validation
    
    - `just test -p codex-core
    responses_websocket_v2_creates_with_previous_response_id_on_prefix`
    - `just test -p codex-core
    responses_websocket_v2_creates_without_previous_response_id_when_non_input_fields_change`
  • avoid cloning sampling request input (#28306)
    ## Why
    
    Every model request cloned the full prepared input just to keep it for
    the legacy after-agent hook. That copy gets more expensive as the
    conversation grows.
    
    ## What
    
    Move the prepared input into the sampling loop and return it with the
    result. If the request retries, keep the first input so the hook still
    sees the same data as before.
    
    This removes one `O(n)` clone per sampling request, where `n` is the
    size of the prepared input. It saves `O(n)` copy work and `O(n)`
    temporary memory.
    
    No behavior change is intended.
    
    ## Performance
    
    Local rollout traces show turns reaching roughly 260k input tokens. On
    turns of that size, this removes the only unconditional full
    prepared-input clone on the happy path. That avoids one request-sized
    allocation/copy per sampling attempt for large conversations, and the
    savings scale linearly with request size.
    
    ## Testing
    
    - `just test -p codex-core continue_after_stream_error`
    - `just fix -p codex-core`
  • linearize history output normalization (#28309)
    ## Why
    
    When we prepare the conversation history, every tool call needs a
    matching output.
    
    Before this change, we scanned the full history again for every call. In
    a tool-heavy conversation, that makes the work `O(items x calls)`, or
    `O(n^2)` in the worst case.
    
    ## What
    
    Scan the history once and collect the IDs of existing outputs. Then each
    call can check its ID with an expected `O(1)` lookup.
    
    The full normalization step is now expected `O(n)`. The output order and
    missing-output behavior stay the same.
    
    ## Performance
    
    Based on local rollout traces, one tool-heavy session reached roughly
    17,050 transcript items with about 4,292 tool-call items. On a history
    of that shape, the old `calls x items` scan does about 73.2 million
    membership checks, while the new pass does about 21.3 thousand set
    inserts/lookups. That is roughly 3.4k times less membership work in this
    normalization step.
    
    ## Validation
    
    - `just test -p codex-core normalize_` (19 passed)
  • [codex] simplify memory read metrics (#28164)
    ## Why
    
    Memory read telemetry currently reconstructs the executable shell
    command after a tool call finishes. That duplicates shell, login-policy,
    and cwd resolution owned by the tool handlers, and can diverge from the
    environment-specific command that unified exec actually ran.
    
    ## What changed
    
    - Expose the existing restricted shell-script parser directly for raw
    script text.
    - Parse `shell_command` and `exec_command` input into plain command argv
    before classifying memory reads.
    - Preserve all-or-nothing safe-command validation for multi-command
    scripts.
    - Remove cwd resolution, shell selection, and the unnecessary async
    boundary from memory read metric emission.
    
    ## Testing
    
    - `just test -p codex-shell-command`
    - `cargo check -p codex-core`
  • [codex] simplify shell snapshot ownership (#27756)
    ## Why
    
    Shell snapshot lifecycle state was split between `Shell` and
    `SessionServices`: `Shell` carried the receiver while session code
    exposed and forwarded the raw sender. That coupled shell identity to
    mutable snapshot state and made refresh, inheritance, and file lifetime
    harder to reason about.
    
    ## What changed
    
    - make each `Arc<ShellSnapshot>` represent one cwd-specific snapshot
    generation
    - store the active generation in `SessionServices` with `ArcSwapOption`
    - have construction start the background build and expose only a
    cwd-validated snapshot path
    - use `ShellSnapshotFile` ownership to delete snapshot files
    automatically
    - pass snapshot paths explicitly to shell runtimes instead of storing
    snapshot state on `Shell`
    - preserve inherited and in-flight generations by pinning their `Arc`
    while they are in use
    
    ## Test plan
    
    - `cargo check -p codex-core --lib`
    - `just test -p codex-core 'shell_snapshot::tests'`
    - `just test -p codex-core
    shell_command_snapshot_still_intercepts_apply_patch`
    - `just test -p codex-core
    shell_snapshot_deleted_after_shutdown_with_skills`
  • skills: hide orchestrator skills with a local executor (#28333)
    ## Why
    
    App-server threads without a local executor need orchestrator-owned
    skills from the hosted `codex_apps` MCP server. Threads with the local
    executor already discover installed skills from the local filesystem.
    
    After the orchestrator skill provider was enabled for every app-server
    thread, local-executor threads also received the hosted skill catalog
    and the `skills.list` and `skills.read` tools. This changed the existing
    local behavior and could expose a second hosted copy of a skill that was
    already installed locally.
    
    ## What changed
    
    - Expose the thread's selected execution environments to extensions at
    thread startup.
    - Enable orchestrator skills only when the reserved local environment is
    not selected.
    - Apply that decision consistently to hosted skill catalog discovery,
    explicit skill injection, and the `skills.list` and `skills.read` tools.
    
    ## Verification
    
    - The existing no-executor app-server test continues to verify hosted
    skill discovery, invocation, and child-resource reads.
    - A new app-server test verifies that local-executor threads do not
    receive hosted skill context or `skills.*` tools.
  • Represent dynamic tools with explicit namespaces internally (#27365)
    Follow-up to #27356.
    
    ## Stack note
    
    This PR changes Codex's internal dynamic-tool shape while leaving
    `thread/start` unchanged. App-server therefore converts the existing
    per-tool input into explicit functions and namespaces before passing it
    to core.
    
    [#27371](https://github.com/openai/codex/pull/27371) updates
    `thread/start` to use the same explicit shape and removes this temporary
    conversion.
    
    ## Why
    
    Dynamic tools repeat namespace metadata on every function. Core should
    keep one explicit namespace with its member tools so descriptions and
    membership stay consistent across sessions and runtime planning.
    
    ## What changed
    
    - Represent dynamic tools as top-level functions or explicit namespaces
    in protocol and session state.
    - Read old flat rollout metadata and write the canonical hierarchy.
    - Flatten namespace members only when registering callable tools.
    - Keep `thread/start.dynamicTools` flat for now and normalize it at the
    app-server boundary.
    
    New builds can read old rollout metadata. Older builds cannot read newly
    written hierarchical metadata.
    
    ## Test plan
    
    - `just test -p codex-app-server
    thread_start_normalizes_legacy_dynamic_tools_into_model_request`
    - `just test -p codex-protocol
    session_meta_normalizes_legacy_dynamic_tools`
    - `just test -p codex-core
    resume_restores_dynamic_tools_from_rollout_with_sqlite_enabled`
    - `just test -p codex-core
    tool_search_returns_deferred_dynamic_tool_and_routes_follow_up_call`
    - `just test -p codex-core code_mode_can_call_hidden_dynamic_tools`
    - `just test -p codex-tools`
  • [codex] update multi-agent v2 prompts (#28283)
    ## Summary
    
    - align the default multi-agent v2 root and subagent hints with the
    evaluated prompt guidance for direct collaboration-tool calls, parallel
    delegation, and shared workspaces
    - keep the current `interrupt_agent` tool name and existing
    concurrency-hint placement, with the explicit no-spawn instruction last
    - document the context tradeoff between `fork_turns="none"` and
    `fork_turns="all"` in the v2 `spawn_agent` description
    - extend the focused prompt and tool-surface tests
    
    ## Why
    
    The evaluated multi-agent prompt includes operational guidance that is
    missing from the current Codex defaults. This applies that guidance to
    the current tool surface without restoring stale `close_agent` or
    duplicated concurrency wording.
    
    ## User impact
    
    Multi-agent v2 receives clearer instructions about when and how to
    parallelize work, how agent workspaces interact, and how `fork_turns`
    affects subagent context. The existing default opt-out behavior remains
    in place.
    
    ## Testing
    
    - `just fmt`
    - `just test -p codex-core
    multi_agent_v2_default_usage_hints_use_configured_thread_cap`
    - `just test -p codex-core
    multi_agent_feature_selects_one_agent_tool_family`
  • Add selected-plugin precedence and attribution to the MCP catalog (#27884)
    ## Why
    
    **In short:** this PR resolves already-discovered MCP registrations. It
    does not read selected plugins or discover their MCP servers.
    
    The resolved MCP catalog currently builds config and auto-discovered
    plugin registrations before runtime contributors are applied. A
    thread-selected plugin needs a distinct precedence tier in that same
    initial resolution pass: otherwise a disabled lower-precedence winner
    can leave stale name-level state behind, and the winning MCP tools
    cannot be attributed to the selected package reliably.
    
    This PR adds that catalog boundary before executor discovery is
    connected.
    
    ## What changed
    
    - Added an explicit selected-plugin registration tier between
    auto-discovered plugins and explicit config.
    - Collected selected-plugin contributions before the initial catalog
    build, while leaving compatibility and generic extension overlays in
    their existing runtime phase.
    - Retained the winning plugin ID and display name directly on
    plugin-owned catalog registrations.
    - Derived MCP tool provenance from the winning catalog entry instead of
    joining against local-only plugin summaries.
    - Retained the winning selected server's tool approval policy in the
    running connection manager, so a selected registration cannot inherit
    approval behavior from a losing local plugin.
    - Kept remembered approval session-scoped for selected plugins until
    there is an authority-aware persistence contract; Codex will not write
    approval back to an unrelated local plugin.
    - Preserved existing name-level disabled vetoes for discovered plugins
    and config, while keeping a selected package's own disabled registration
    scoped to that registration.
    - Preserved deterministic selection order and existing config,
    compatibility, and extension precedence.
    
    The resulting order is:
    
    ```text
    auto-discovered plugin
      < selected plugin
      < explicit config
      < compatibility registration
      < extension overlay
    ```
    
    ## Behavior and scope
    
    This is a catalog and provenance change only. No production host
    contributes selected-plugin MCP registrations yet, so existing local MCP
    behavior remains unchanged.
    
    The stacked follow-up, #27870, installs the executor plugin provider
    that produces these registrations. App-server activation remains a
    separate final step.
    
    ## Verification
    
    Focused tests cover precedence, deterministic selected-plugin conflicts,
    disabled-veto behavior across catalog phases, managed requirements
    before selected-plugin resolution, winning-server approval policy, and
    attribution when local and selected packages share an ID or server name.
    CI owns execution of the test suite.
  • feat(app-server): filter threads by parent (#26662)
    ## Why
    
    Clients that display or coordinate spawned subagents need an
    authoritative snapshot of a thread's immediate spawned children when
    they connect to app-server or recover after missing live events.
    `thread/list` cannot query by parent, so clients must otherwise scan
    unrelated threads or reconstruct relationships from rollout history and
    transient events.
    
    The direct spawn relationship already exists in persisted
    `thread_spawn_edges` state. Review and Guardian threads do not
    participate in that lifecycle and are intentionally outside this
    filter's scope.
    
    ## What changed
    
    This adds an experimental `parentThreadId` filter to `thread/list`.
    Parent-filtered requests return direct spawned children from persisted
    state while preserving the existing response shape, explicit filters,
    sorting, and timestamp-only cursor behavior. The lookup does not read
    rollout transcripts or recursively return descendants.
    
    Supersedes #25112 with the narrower `thread/list` filter approach.
    
    ## How it works
    
    1. An experimental client passes a valid thread ID as `parentThreadId`.
    2. App-server routes the list through the existing thread-store and
    state-database boundaries.
    3. SQLite selects threads whose IDs have a direct persisted spawn edge
    from that parent.
    4. Omitted provider and source filters include all values; explicit
    filters keep ordinary `thread/list` semantics.
    5. Grandchildren, Review threads, and Guardian threads are excluded.
    
    ## Verification
    
    State (144 tests), rollout (69 tests), and focused app-server
    thread-list (31 tests) suites passed. Scoped Clippy checks and
    repository formatting also passed. Coverage includes direct spawned
    children, omitted grandchildren, pagination, malformed IDs, mixed source
    kinds, explicit filters, and operation without rollout files.
  • [codex] exec-server honors remote environment cwd and shell (#28122)
    ## Why
    
    Next slice needed to make progress on the `remote_env_windows` test is
    to support passing a Windows cwd for the remote environment and using
    that environment's native shell. This lets the test run a real Windows
    process instead of only recording an early path or shell mismatch.
    
    ## What
    
    - change `TurnEnvironmentSelection.cwd` from `AbsolutePathBuf` to
    `PathUri`
    - convert local cwd values to URIs when constructing selections
    - preserve a remote primary cwd instead of replacing it with the local
    legacy fallback
    - prefer the selected environment's discovered shell for unified exec,
    falling back to the session shell when unavailable
    - convert back to a host-native absolute path at current native-only
    consumer boundaries
    - reject or deny unsupported foreign cwd values at the existing
    request-permissions boundary, with TODOs for its future migration
    - extend the hermetic Wine test to execute Windows PowerShell in
    `C:\windows` and verify successful process completion
    - record the current app-server rejection against the same Wine-backed
    remote Windows fixture when its cwd is supplied as a native Windows path
  • build: run buildifier from just fmt (#28125)
    ## Intent
    
    Keep Bazel and Starlark files consistently formatted without requiring
    contributors to install or version buildifier themselves.
    
    ## Implementation
    
    - Add a SHA-256-pinned, cross-platform DotSlash manifest for buildifier
    v8.5.1.
    - Run buildifier from the shared `just fmt` and `just fmt-check` driver,
    with Windows-safe explicit DotSlash invocation.
    - Provision DotSlash in formatting CI and contributor devcontainers, and
    document the source-build prerequisite.
    - Apply the initial mechanical buildifier formatting baseline.
  • [codex] Dedupe plugin MCPs by app declaration name (#27607)
    ## Context
    
    This is the next step in the plugin auth-routing stack. The earlier PRs
    make `PluginsManager` auth-aware and move the broad App/MCP surface
    decision into that layer. This PR narrows the ChatGPT/SIWC behavior so
    we only hide a plugin MCP server when it conflicts with an App
    declaration of the same name.
    
    In product terms: if a plugin exposes both an App route and MCP route
    for `foo`, ChatGPT/SIWC sessions should use the App route for `foo`. If
    the same plugin also exposes a separate MCP server like `foo2`, that MCP
    server should remain available.
    
    ```json
    // .app.json
    {
      "apps": {
        "foo": {
          "id": "connector_abc"
        }
      }
    }
    ```
    
    ```json
    // .mcp.json
    {
      "mcpServers": {
        "foo": {
          "url": "https://mcp.foo.com/mcp"
        },
        "foo2": {
          "url": "https://mcp.foo2.com/mcp"
        }
      }
    }
    ```
    
    ## Stack
    
    - PR1: #27652 seed plugin manager auth at construction.
    - PR2: #27459 route plugin surfaces by auth mode.
    - PR3: #27607 dedupe plugin MCP servers by App declaration name.
    - PR4: #27602 preserve plugin Apps in connector listings.
    - PR5: #27461 skip install-time plugin MCP OAuth for matching App
    routes.
    
    ## Summary
    
    - Preserve App declaration names in loaded plugin metadata.
    - Keep public effective App outputs as deduped connector IDs for
    existing callers.
    - For ChatGPT/SIWC, suppress only plugin MCP servers whose names match
    declared App names.
    
    ## Validation
    
    ```bash
    cargo fmt --all
    cargo test -p codex-core-plugins plugin_auth_projection
    cargo test -p codex-core-plugins effective_apps
    cargo test -p codex-core-plugins read_plugin_for_config_installed_git_source_reads_from_cache_without_cloning
    cargo test -p codex-core explicit_plugin_mentions_use_apps_for_chatgpt_dual_surface_plugins
    cargo test -p codex-core explicit_plugin_mentions_keep_non_conflicting_mcp_for_chatgpt_auth
    cargo test -p codex-app-server --test all plugin_install_filters_disallowed_apps_needing_auth
    git diff --check
    ```
    
    ---------
    
    Co-authored-by: Xin Lin <xl@openai.com>
  • [codex] Carry exec-server cwd as PathUri (#28032)
    ## Why
    
    This is the second-to-last place in the exec-server protocol that needs
    to migrate to URIs to support cross-OS operation.
    
    ## What
    
    - Change `ExecParams.cwd` to `PathUri`.
    - Keep the cwd URI-shaped through core and rmcp producers, converting it
    to `AbsolutePathBuf` only in `LocalProcess::start_process`.
    - Reject non-native cwd URIs before launch and update the affected
    protocol documentation and call sites.
  • [codex] Send turn state through compact requests (#28002)
    ## Context
    
    Inline compaction is part of the active logical turn. Compact requests
    and the sampling requests around them should use the same turn state,
    including when compaction is the first request to establish it.
    
    ## Change
    
    Pass the turn-scoped `OnceLock` directly to inline v1 compaction so
    `/responses/compact` includes an established value in the existing HTTP
    header. Capture `x-codex-turn-state` from the compact response into that
    same lock, allowing pre-turn compact to establish the value that
    subsequent sampling reuses.
    
    V2 compact already uses the normal Responses HTTP/WebSocket path and
    continues to share the same `OnceLock` without separate plumbing. The
    first returned value wins for the logical turn.
    
    ## Test plan
    
    Integration coverage verifies that:
    
    - pre-turn v1 compact can establish state for the first sampling request
    - inline v1 compact receives established state over HTTP
    - inline v2 compact reuses established state over HTTP
    - inline v2 compact reuses established state over WebSocket
    
    CI validates the full change.