23 Commits

  • [codex] consume pushed exec-server process events (#30273)
    ## Summary
    
    - complete unified-exec processes from the ordered event stream instead
    of issuing a final zero-wait `process/read`
    - add optional executor sandbox-denial state to `process/exited`
    - retain `process/read` as a retained-output and compatibility fallback
    for receiver lag, sequence gaps, and legacy servers
    - recover sandbox-denial state across transport reconnection
    - cover the real `TestCodex` remote-exec path without adding a public
    test-only event constructor
    
    ## Why
    
    A successful one-shot tool call currently receives its output and
    terminal notifications, then pays another wide-area `process/read` round
    trip before returning. Staging traces showed that remote response wait
    accounted for more than 99.8% of RPC time; local serialization,
    queueing, and deserialization were below 0.6 ms.
    
    ## Measured impact
    
    A direct staging A/B used the same build and route and changed only
    completion mode. Each arm ran three times with 30 one-shot
    `/usr/bin/true` calls per run. The table reports the median of the three
    per-run percentiles.
    
    | Metric | Final `process/read` | Pushed events | Change |
    | --- | ---: | ---: | ---: |
    | End-to-end completion p50 | 159.5 ms | 118.7 ms | -40.8 ms (-25.6%) |
    | End-to-end completion p95 | 182.4 ms | 131.7 ms | -50.6 ms (-27.8%) |
    | Completion-wait p50 | 80.1 ms | 41.5 ms | -38.5 ms (-48.1%) |
    | Final `process/read` RPC p50 | 79.9 ms | eliminated | -79.9 ms |
    
    TCP_NODELAY was enabled in both A/B arms, so its effect cancels out. The
    successful, complete, in-order event path issued zero final
    `process/read` calls.
    
    ## Compatibility and recovery
    
    - new servers send `sandboxDenied` on `process/exited`
    - legacy servers omit it, which triggers one compatibility
    `process/read`
    - broadcast lag or a sequence gap triggers a retained-output read
    - recovery remains bounded by the server's existing 1 MiB
    retained-output window
    - complete, in-order event streams issue no completion read
    - sandbox denial is attached to the exit event before consumers can
    observe process completion
    - server-first and client-first rollouts remain wire-compatible;
    server-first realizes the latency win immediately
    
    ## Integration coverage
    
    The `TestCodex` suite exercises four distinct remote-exec contracts:
    
    - complete pushed output/exit/close with zero reads
    - direct pushed sandbox denial with zero reads
    - legacy missing denial metadata with exactly one compatibility read
    - count-bounded replay eviction recovered from retained output without
    duplication
    
    ## Validation
    
    - `just test -p codex-core
    exec_command_consumes_pushed_remote_process_events`: 4 passed
    - `just test -p codex-core unified_exec::process_tests::`: 4 passed
    - `just test -p codex-exec-server`: 294 passed, 2 skipped
    - `just test -p codex-exec-server-protocol`: 5 passed
    - `just test -p codex-rmcp-client`: 89 passed, 2 skipped
    - focused Bazel `//codex-rs/core:core-all-test`: passed across 16 shards
    - scoped `just fix` passed for core and exec-server
    - `just fmt` passed
    
    The complete workspace suite was not rerun; focused Cargo and Bazel
    coverage passed for the changed behavior.
  • [codex] Record exec-server lifecycle metrics (#27467)
    ## Summary
    
    - Record bounded connection, request, and process lifecycle metrics.
    - Report active gauges from callbacks on every collection, including
    delta exports.
    - Serialize active-count updates so concurrent starts and finishes
    cannot publish stale values.
    - Serialize process exit, explicit termination, and shutdown through the
    process registry so exactly one completion result wins.
    - Keep the implementation small with single-owner RAII guards and one
    real OTLP/HTTP integration test using the existing `wiremock`
    dependency.
    
    ## Root cause
    
    Process exit and session shutdown previously used cloned completion
    state. That avoided duplicate emission, but it duplicated lifecycle
    ownership and made the ordering harder to reason about. The process
    registry mutex already defines the lifecycle ordering, so the final
    implementation stores the metric guard and termination flag directly on
    the process entry. Whichever path claims the entry first owns the
    completion result.
    
    Production metric export uses delta temporality. Event-only synchronous
    gauge recordings disappear after the next collection when no count
    changes, so active counts now use observable callbacks that report
    current state on every collection.
    
    The cleanup also removes the constant `result="accepted"` connection
    tag, redundant route and response assertions, a custom HTTP collector,
    and fallback initialization machinery that did not add behavior.
    
    ## Stack
    
    Review and land this stack in order:
    
    1. #27466 — trace exec-server JSON-RPC requests
    2. #27467 — record bounded connection, request, and process lifecycle
    metrics **(this PR)**
    3. #27470 — observe remote registration and Noise rendezvous lifecycle
    
    ## Validation
    
    - `just test -p codex-exec-server --lib` (158 passed)
    - `just test -p codex-cli --test exec_server` (3 passed)
    - `just test -p codex-otel
    observable_gauge_is_collected_on_every_delta_snapshot` (1 passed)
    - `CARGO_BUILD_JOBS=1 just fix -p codex-otel -p codex-exec-server`
    - `just fmt`
    - `git diff --check`
  • protocol: separate app and exec RPC ownership (#29714)
    ## Why
    
    The app-server and exec-server expose separate JSON-RPC APIs, but
    exec-server currently sources its serialized protocol and envelope types
    through app-server-oriented code. Giving each API an explicit owner
    makes the crate boundary legible without introducing shared generic
    envelopes.
    
    ## What changed
    
    - Added `codex-exec-server-protocol` to own exec DTOs, process IDs, and
    JSON-RPC envelopes.
    - Updated exec-server clients, transports, handlers, and tests to use
    the new crate.
    - Exposed app-server's existing JSON-RPC types through a public `rpc`
    module while retaining root re-exports.
    - Preserved existing wire shapes, including exec `PathUri` behavior.
    
    ## Stack
    
    This is PR 1 of 6. Next: [PR
    #29721](https://github.com/openai/codex/pull/29721), which moves auth
    mode below the app wire boundary.
    
    ## Validation
    
    - Exec-server protocol and server coverage passed in the focused
    protocol test runs.
    - App-server protocol schema fixtures passed.
  • Prepare managed network sandbox context (#29456)
    ## Why
    
    Managed network configures commands to use local HTTP and SOCKS proxies.
    For commands delegated to the exec server, the proxy environment and the
    sandbox policy were prepared separately. On macOS, that meant a command
    could receive `HTTPS_PROXY=http://127.0.0.1:43123` while Seatbelt still
    denied access to port `43123`.
    
    ## What changed
    
    `NetworkProxy` now prepares the command environment and sandbox context
    together from the same runtime snapshot:
    
    ```text
    Prepared managed network
    ├── command environment: HTTPS_PROXY=http://127.0.0.1:43123
    └── sandbox context: allow outbound to 127.0.0.1:43123
    ```
    
    That context travels with remote exec requests. The exec server
    preserves the managed proxy and CA environment, and macOS Seatbelt
    allows only the prepared loopback proxy ports without enabling broad
    network access or local binding.
    
    The protocol field is optional and the existing enforcement flag remains
    in place, preserving compatibility with callers that do not send the new
    context.
  • path-uri: clarify host-native path conversion (#29501)
    ## Why
    
    Downstream refactors are producing confusing code with this
    functionality having a very generic name. Encoding the specific
    conversion approach in the method name makes it clearer.
    
    ## What
    
    Rename `PathUri::from_path` to `PathUri::from_host_native_path` and
    update its Rust call sites.
  • Report remote sandbox denials semantically (#29424)
    ## Why
    
    #29113 moved remote sandbox setup and enforcement to the exec server.
    That gives the executor ownership of the platform-specific work: a Linux
    executor chooses and runs a Linux sandbox even when the Codex
    orchestrator is running on macOS or Windows.
    
    It also means the orchestrator no longer knows which concrete sandbox
    the executor selected. When that sandbox blocks a remote command, the
    orchestrator currently sees only a failed process and can treat the
    denial as an ordinary command failure. The existing sandbox approval and
    retry path is then skipped.
    
    This PR lets the executor report one portable fact:
    
    > This command probably failed because the executor sandbox blocked it.
    
    The executor keeps its concrete sandbox type private. The protocol sends
    only the semantic result.
    
    ## Example
    
    Suppose a local macOS Codex session asks a Linux devbox to write outside
    the allowed workspace.
    
    Before this PR:
    
    ```text
    Linux sandbox blocks the write
        -> remote process exits with "Permission denied"
        -> local orchestrator sees an ordinary command failure
        -> the normal sandbox approval and retry path can be skipped
    ```
    
    With this PR:
    
    ```text
    Linux sandbox blocks the write
        -> executor reports sandboxDenied: true
        -> unified exec returns UnifiedExecError::SandboxDenied
        -> the existing approval prompt is shown
        -> an approved retry runs through the existing unsandboxed retry path
    ```
    
    ## What changes
    
    ### The executor remembers its selected sandbox
    
    The prepared remote process now retains the executor-selected
    `SandboxType`. This value never crosses the executor boundary.
    
    Commands started without a sandbox retain `SandboxType::None` and are
    never reported as sandbox denials.
    
    ### The executor uses the existing denial heuristic
    
    The existing local denial heuristic moves from `codex-core` into the
    shared `codex-sandboxing` crate.
    
    When a sandboxed remote process exits, the executor:
    
    1. waits the same short output grace period used by local unified exec;
    2. reads the output currently available in the existing retained output
    buffer;
    3. runs the existing heuristic using the exit code and common denial
    messages;
    4. stores the yes/no result before publishing the process exit.
    
    This deliberately matches the old local unified-exec behavior. It does
    not add a new streaming classifier, another output buffer, or stronger
    output-retention guarantees.
    
    ### The protocol reports a portable boolean
    
    `process/read` gains `sandboxDenied`:
    
    ```json
    {
      "exited": true,
      "exitCode": 1,
      "closed": false,
      "sandboxDenied": true
    }
    ```
    
    The field defaults to `false` when an older executor omits it. The
    response does not expose the executor sandbox implementation or
    executor-native paths.
    
    ### Unified exec uses the existing error path
    
    The exec-server client carries `sandboxDenied` into the unified process
    state. If it is true, unified exec returns the existing `SandboxDenied`
    error instead of trying to classify remote output using an
    orchestrator-side sandbox type.
    
    Remote process exit remains visible as soon as the process exits. This
    PR does not wait for stdout or stderr to close and does not change the
    existing process lifecycle.
    
    ## Scope
    
    This PR is intentionally limited to matching the existing local
    unified-exec behavior for the initial command execution path.
    
    It does not add:
    
    - incremental denial tracking across the full output stream;
    - new denial handling for commands completed later through
    `write_stdin`;
    - new guarantees for preserving the semantic flag during the narrow
    reconnect-recovery race.
    
    Those can be considered separately if the same behavior is added for
    local execution.
    
    ## Test coverage
    
    One remote end-to-end integration test covers the complete intended
    flow:
    
    ```text
    remote read-only sandbox
        -> denied write
        -> executor reports the denial
        -> Codex requests approval
        -> user approves
        -> retry succeeds on the remote executor
    ```
    
    Existing lifecycle coverage continues to verify that remote process exit
    is reported before late output streams close.
  • Apply sandbox intent inside remote exec servers (#29113)
    ## Why
    
    PR #29108 lets the orchestrator send sandbox intent with `process/start`
    without wrapping the command for its own operating system.
    
    This PR completes that boundary by making the executor interpret and
    enforce the intent using its own filesystem paths and sandbox
    implementation.
    
    For example, a macOS TUI targeting a Linux devbox sends `/bin/bash -lc
    pwd`. The Linux executor turns that into its own `codex-linux-sandbox
    ... /bin/bash -lc pwd` launch.
    
    ## What changes
    
    - Keep `process/start` unchanged when no sandbox intent is present.
    - Convert sandbox `PathUri` values into native paths on the executor.
    - Bind symbolic `:workspace_roots` permissions to the executor's native
    sandbox cwd.
    - Select the sandbox implementation on the executor and wrap the
    original command immediately before spawning it.
    - Reject sandbox-required execution before spawning when the executor
    cannot enforce the intent.
    - Pass exec-server runtime paths into process creation so Linux can
    locate `codex-linux-sandbox`.
    
    The boundary is therefore:
    
    ```text
    orchestrator                         executor
    original argv + sandbox intent  ->  select and enforce local sandbox
    ```
    
    This PR intentionally treats a denied remote command as an ordinary
    command failure. Draft follow-up #29424 carries a semantic
    `sandboxDenied` result back to unified exec for the existing approval
    and retry flow.
    
    ## Platform scope
    
    Linux and macOS use their existing direct-spawn sandbox transforms.
    
    Windows sandboxed remote process launch is intentionally unsupported in
    this PR. The current Windows direct-spawn wrapper does not correctly
    preserve arbitrary argv, TTY behavior, or pass the full child
    environment out of band. The executor rejects the request instead of
    running it incorrectly or unsandboxed.
    
    ## Known follow-ups
    
    - The transported permission profile can still contain
    orchestrator-materialized helper or explicit paths. A `TODO(jif)` marks
    where the executor boundary should receive pre-host-materialization
    permission intent.
    - The sandbox wrapper currently replaces a requested custom inner
    `arg0`. A `TODO(jif)` marks where this must be preserved or rejected
    explicitly.
    - Draft PR #29424 contains the deferred sandbox-denial classification
    and approval/retry behavior.
    
    ## Rollout assumption
    
    This executor-sandbox stack is unreleased and its client and executor
    are expected to move together. This PR does not add mixed-version
    negotiation with older exec servers.
  • Carry sandbox intent to remote exec servers (#29108)
    ## What changed
    
    PR #29099 stopped sending the orchestrator's concrete sandbox wrapper to
    a remote exec-server. Remote commands now arrive as plain native argv.
    
    This PR adds the next piece: Codex also sends portable sandbox intent
    next to that plain argv.
    
    For a remote unified-exec command, the request can now include:
    
    - the canonical permission profile before local workspace-root
    materialization
    - the sandbox cwd and workspace roots as `PathUri` values
    - Windows sandbox settings
    - the legacy Landlock setting
    - whether managed networking must be enforced
    
    The important part is that symbolic entries such as `:workspace_roots`
    stay symbolic while crossing the boundary. The executor can then bind
    them to its own workspace-root paths instead of receiving
    orchestrator-local absolute paths.
    
    The data travels through `ExecRequest` into `ExecParams`. Older
    exec-servers can still deserialize requests because the new fields have
    defaults.
    
    ## Why
    
    The orchestrator should not decide how another machine implements
    sandboxing.
    
    For example:
    
    - a local macOS Codex would normally build a Seatbelt command
    - a remote Linux executor needs a Linux sandbox command instead
    
    The orchestrator now sends the plain command plus the policy it intended
    to enforce. A later PR can let the exec-server choose and build the
    correct sandbox for its own operating system.
    
    ## Important detail
    
    This keeps the portable intent separate from the local `SandboxType`.
    
    `SandboxType::None` is ambiguous:
    
    - it can mean the command was explicitly approved to run without a
    sandbox
    - it can also mean the orchestrator host has no concrete sandbox
    implementation available
    
    Those cases are different for remote execution. This PR adds
    `sandbox_requested` so an executor can still receive sandbox intent when
    the orchestrator cannot build a local wrapper. Explicit unsandboxed
    retries still send no sandbox context.
    
    ## Behavior today
    
    This PR only transports the intent. The exec-server accepts the new
    fields but does not apply them yet.
    
    Remote commands therefore remain unsandboxed after this PR, just as they
    are after PR #29099.
    
    ## Follow-up
    
    The next PR will make exec-server read this portable intent, bind
    symbolic workspace permissions to executor-native roots, choose the
    sandbox for its own operating system, build the wrapper locally, and
    then spawn the command.
  • Recover exec process stdin writes (#28895)
    ## Summary
    
    Remote stdio MCP servers send tool calls by writing JSON-RPC bytes
    through `process/write`.
    
    When the exec-server websocket drops at the wrong time, the remote
    process can survive session recovery, but the stdin write can still fail
    back to RMCP as a transport send error. RMCP then closes the stdio MCP
    transport, so tools like `node_repl` are lost even though the
    process/session recovery path is working.
    
    This changes `process/write` to be safe to retry across exec-server
    recovery:
    
    - adds a required `writeId` to `process/write`
    - retries remote `Session::write` with the same `writeId` after
    reconnect
    - remembers accepted write ids per process so duplicate retries return
    `Accepted` without writing the same bytes to child stdin again
    - covers both the client retry path and server-side write id dedupe with
    tests
    
    In simple terms:
    
    ```text
    before:
    write to MCP stdin -> websocket closes -> write errors -> RMCP closes node_repl
    
    after:
    write to MCP stdin -> websocket closes -> reconnect -> retry same writeId
    server either writes once or recognizes it already did
    ```
  • Resume exec-server sessions after disconnect (#28512)
    Supersedes #28288 (closed).
    
    ## Why
    
    A short WebSocket interruption currently ends every client-side process
    handle, even though exec-server keeps the server session and its
    processes alive for a short time.
    
    This is especially visible for executor-backed stdio MCP servers: a
    temporary connection loss becomes a permanent `Transport closed` error.
    The server already has the information needed to resume the session, but
    the client opens a fresh session instead of using it.
    
    This change reconnects below the process and MCP layers. Existing
    process handles stay valid, missed output is recovered, and the same
    server-side processes continue running.
    
    ## State machine
    
    One logical `ExecServerClient` stays alive while its underlying RPC
    connection changes generations.
    
    ```text
                             transport closes
           +------------------------------------------------+
           |                                                v
    +-------------+                                  +-------------+
    |  Connected  |                                  | Recovering  |
    +-------------+                                  +-------------+
           ^                                                |
           | session resumed, processes caught up           | retryable error
           +------------------------------------------------+ loops until deadline
                                                            |
                                                            | deadline or permanent error
                                                            v
                                                      +-------------+
                                                      |   Failed    |
                                                      +-------------+
    ```
    
    ### `Connected`
    
    - New RPC calls use the current connection.
    - Process notifications are published in sequence order.
    - A disconnect only starts recovery if it came from the current
    connection generation. Late events from older generations cannot replace
    the active connection.
    
    ### `Recovering`
    
    - New calls wait instead of choosing a half-connected RPC client.
    - Existing process handles, wake subscriptions, and event subscriptions
    stay open.
    - Streaming HTTP response bodies fail immediately because their byte
    streams cannot be resumed safely.
    - Recovery first waits for process starts that were already in flight. A
    start whose result became ambiguous is cleaned up after reconnection
    instead of being silently adopted.
    - The client reconnects with the learned `session_id`. The server may
    briefly report that the old connection is still attached, so that error
    is retried until the detach finishes.
    - The notification consumer starts before the resume handshake
    completes. This prevents a busy process from filling the notification
    queue and blocking the initialize response.
    - Before installing the new connection, the client catches up every
    recoverable process with `process/read`.
    
    ### `Failed`
    
    - Recovery stops after 25 seconds or after a permanent error.
    - Waiting calls are released with one stable disconnect error.
    - Existing process sessions receive a terminal failure instead of
    waiting forever.
    
    ## Recovering process events
    
    Output, exit, and close events share one sequence. During normal
    operation, the client buffers early events until every lower sequence
    has been published.
    
    After reconnection, the client reads each process starting after its
    last published sequence:
    
    1. Retained output chunks are inserted by sequence number.
    2. Exit and close state are reconstructed in their sequence positions.
    3. Events already received as live notifications are ignored as
    duplicates.
    4. Newly contiguous events are published in order.
    5. If the server no longer retains enough output to fill a sequence gap,
    only that process is terminated and failed. The recovered connection
    remains usable for other processes.
    
    The server reports its full next event sequence for unbounded reads,
    including exit and close events. Closed processes remain readable for
    the same 30-second window used to retain detached sessions.
    
    ## Other details
    
    - Detached server sessions are retained for 30 seconds, leaving margin
    around the client's 25-second recovery deadline.
    - Session attach and detach update the active notification sender under
    the same attachment lock, so an old connection cannot clear a newly
    attached sender.
    - A dedicated error code distinguishes the temporary "session is still
    attached" race from permanent initialization errors.
    - Process starts are identity-checked on both client and server. Cleanup
    from an older start cannot remove a newer process that reused the same
    ID.
    - Mutating requests that were already in flight when the transport
    closed are not replayed, because the client cannot know whether the
    server applied them. Requests started after recovery is known wait for
    the replacement connection.
    - We assume the server/client version stays in sync (on the before/after
    this PR)
    
    ## User impact
    
    Long-running commands and stdio MCP servers can survive a temporary
    exec-server WebSocket interruption without changing process IDs or
    losing output produced during the outage.
  • [codex] Carry exec-server cwd as PathUri (#28032)
    ## Why
    
    This is the second-to-last place in the exec-server protocol that needs
    to migrate to URIs to support cross-OS operation.
    
    ## What
    
    - Change `ExecParams.cwd` to `PathUri`.
    - Keep the cwd URI-shaped through core and rmcp producers, converting it
    to `AbsolutePathBuf` only in `LocalProcess::start_process`.
    - Reject non-native cwd URIs before launch and update the affected
    protocol documentation and call sites.
  • [codex] Remove async_trait from first-party code (#27475)
    ## Why
    
    First-party async traits should expose their `Send` contracts explicitly
    without requiring `async_trait`. This completes the migration pattern
    established in #27303 and #27304.
    
    ## What changed
    
    - Replaced the remaining first-party `async_trait` traits with native
    return-position `impl Future + Send` where statically dispatched and
    explicit boxed `Send` futures where object safety is required.
    - Kept implementations behavior-preserving, outlining existing async
    bodies into inherent methods where that keeps the diff reviewable.
    - Removed all direct first-party `async-trait` dependencies and the
    workspace dependency declaration.
    - Added a cargo-deny policy that permits `async-trait` only through the
    remaining transitive wrapper crates.
    - Updated `rand` from 0.8.5 to 0.8.6 to resolve RUSTSEC-2026-0097 and
    keep the full cargo-deny check passing.
    
    ## Validation
    
    - `just test -p codex-exec-server`: 216 passed, 2 skipped.
    - `just test -p codex-model-provider`: 39 passed.
    - `just test -p codex-core` and `just test`: changed tests passed;
    remaining failures are environment-sensitive suites unrelated to this
    migration.
    - `cargo deny check`
    - `just fix`
    - `just fmt`
    - `cargo shear`
    - `just bazel-lock-check`
  • [codex] Handle Ctrl-C for non-TTY unified exec (#26734)
    ## Why
    
    A long-running unified exec process started with `tty: false` could not
    be interrupted via `write_stdin`: ordinary non-TTY stdin writes are
    rejected once stdin is closed, but an exact U+0003 payload should still
    map to a process interrupt. The interrupt should flow through the same
    process lifecycle path as a real signal so Codex preserves
    process-reported output and exit metadata instead of fabricating a
    Ctrl-C exit code or tearing down the session early.
    
    ## What Changed
    
    - Add `process/signal` to exec-server with `ProcessSignal::Interrupt`
    and an empty response.
    - Add a non-consuming `ProcessHandle::signal` path for spawned
    processes; on Unix it sends SIGINT to the process group and leaves
    terminate/hard-kill unchanged.
    - Route non-TTY U+0003 `write_stdin` through `process.signal(...)`
    instead of `terminate`, then let the normal post-write collection path
    drain output and observe exit.
    - Add exec-server coverage where a shell `trap INT` handler prints the
    signal and exits with its own code.
    - Add unified exec coverage where a `tty: false` process traps SIGINT,
    emits output, and exits with its own code.
    
    ## Validation
    
    - `just test -p codex-exec-server
    exec_process_signal_interrupts_process`
    - `just test -p codex-exec-server`
    - `just test -p codex-core
    write_stdin_ctrl_c_interrupts_non_tty_session`
  • [codex] Move config loading into codex-config (#19487)
    ## Why
    
    Config loading had become split across crates: `codex-config` owned the
    config types and merge logic, while `codex-core` still owned the loader
    that assembled the layer stack. This change consolidates that
    responsibility in `codex-config`, so the crate that defines config
    behavior also owns how configs are discovered and loaded.
    
    To make that move possible without reintroducing the old dependency
    cycle, the shell-environment policy types and helpers that
    `codex-exec-server` needs now live in `codex-protocol` instead of
    flowing through `codex-config`.
    
    This also makes the migrated loader tests more deterministic on machines
    that already have managed or system Codex config installed by letting
    tests override the system config and requirements paths instead of
    reading the host's `/etc/codex`.
    
    ## What Changed
    
    - moved the config loader implementation from `codex-core` into
    `codex-config::loader` and deleted the old `core::config_loader` module
    instead of leaving a compatibility shim
    - moved shell-environment policy types and helpers into
    `codex-protocol`, then updated `codex-exec-server` and other downstream
    crates to import them from their new home
    - updated downstream callers to use loader/config APIs from
    `codex-config`
    - added test-only loader overrides for system config and requirements
    paths so loader-focused tests do not depend on host-managed config state
    - cleaned up now-unused dependency entries and platform-specific cfgs
    that were surfaced by post-push CI
    
    ## Testing
    
    - `cargo test -p codex-config`
    - `cargo test -p codex-core config_loader_tests::`
    - `cargo test -p codex-protocol -p codex-exec-server -p
    codex-cloud-requirements -p codex-rmcp-client --lib`
    - `cargo test --lib -p codex-app-server-client -p codex-exec`
    - `cargo test --no-run --lib -p codex-app-server`
    - `cargo test -p codex-linux-sandbox --lib`
    - `cargo shear`
    - `just bazel-lock-check`
    
    ## Notes
    
    - I did not chase unrelated full-suite failures outside the migrated
    loader surface.
    - `cargo test -p codex-core --lib` still hits unrelated proxy-sensitive
    failures on this machine, and Windows CI still shows unrelated
    long-running/timeouting test noise outside the loader migration itself.
  • fix(exec-server): retain output until streams close (#18946)
    ## Why
    
    A Mac Bazel run hit a flake in
    `server::handler::tests::output_and_exit_are_retained_after_notification_receiver_closes`
    where the read path observed process exit but lost the expected buffered
    stdout (`first\nsecond\n`). See the [GitHub Actions
    job](https://github.com/openai/codex/actions/runs/24758468552/job/72436716505)
    and [BuildBuddy
    invocation](https://app.buildbuddy.io/invocation/37475a12-4ef2-45fb-ab8a-e49a2aba1d59).
    
    The underlying race is that process exit is not the same thing as
    stdout/stderr closure. If a child or grandchild inherits the pipe write
    end, or a process duplicates it with `dup2`, the watched process can
    exit while the stream is still open and more output can still arrive.
    The exec-server was starting exited-process retention cleanup from the
    exit event, so the process entry could be removed before the output
    streams had actually closed.
    
    While stress-testing the exec-server unit suite,
    `server::handler::tests::long_poll_read_fails_after_session_resume`
    exposed a separate test race: it started a short-lived command that
    could exit and wake the pending long-poll read before the session-resume
    assertion observed the resumed-session error. That test is intended to
    cover resume eviction, not process-exit delivery, so this change keeps
    the process alive and quiet while the second connection resumes the
    session.
    
    ## What changed
    
    - Keep exec-server process entries retained until stdout/stderr streams
    close, then start the post-exit retention timer from the closed event.
    - Wake long-poll readers when the closed event is emitted.
    - Add focused `local_process` unit coverage that proves late output is
    still retained after the short test retention interval has elapsed, and
    that closed process entries are eventually evicted.
    - Add a local and remote regression test where a parent exits while a
    child keeps inherited stdout open. The child waits on an explicit
    release file, so the test deterministically observes exit first,
    releases the child, then requires a nonzero-wait read from the exit
    sequence to receive the late output.
    - In `codex-rs/exec-server/src/server/handler/tests.rs`, make
    `long_poll_read_fails_after_session_resume` run a long-lived silent
    command instead of a short command that prints and exits. This isolates
    the test to session-resume behavior and prevents a normal process exit
    from satisfying the pending long-poll read first.
    
    ## Testing
    
    - `cargo test -p codex-exec-server
    exec_process_retains_output_after_exit_until_streams_close`
    - `cargo test -p codex-exec-server local_process::tests`
    - `cargo test -p codex-exec-server`
    - `just fix -p codex-exec-server`
    - `bazel test //codex-rs/exec-server:exec-server-unit-tests
    //codex-rs/exec-server:exec-server-exec_process-test
    //codex-rs/exec-server:exec-server-file_system-test
    //codex-rs/exec-server:exec-server-http_client-test
    //codex-rs/exec-server:exec-server-initialize-test
    //codex-rs/exec-server:exec-server-process-test
    //codex-rs/exec-server:exec-server-websocket-test`
    - `bazel test --runs_per_test=25
    //codex-rs/exec-server:exec-server-unit-tests`
    
    ## Documentation
    
    No docs update needed; this is an internal exec-server correctness fix.
  • exec-server: wait for close after observed exit (#19130)
    ## Why
    
    Windows CI can flake in
    `server::handler::tests::output_and_exit_are_retained_after_notification_receiver_closes`
    after a process has exited but before both output streams have closed.
    `exec/read` returned immediately whenever `exited` was true, so callers
    that had already observed the exit event could spin instead of
    long-polling for the later `closed` state.
    
    ## What Changed
    
    - Keep returning immediately when a terminal exit event is newly
    observable.
    - Allow later reads, after the caller has advanced past that event, to
    wait for `closed` or new output until `wait_ms` expires.
    
    ## Verification
    
    - CI pending.
  • [3/6] Add pushed exec process events (#18020)
    ## Summary
    - Add a pushed `ExecProcessEvent` stream alongside retained
    `process/read` output.
    - Publish local and remote output, exit, close, and failure events.
    - Cover the event stream with shared local/remote exec process tests.
    
    ## Testing
    - `cargo check -p codex-exec-server`
    - `cargo check -p codex-rmcp-client`
    - Not run: `cargo test` per repo instruction; CI will cover.
    
    ## Stack
    ```text
    o  #18027 [6/6] Fail exec client operations after disconnect
    │
    o  #18212 [5/6] Wire executor-backed MCP stdio
    │
    o  #18087 [4/6] Abstract MCP stdio server launching
    │
    @  #18020 [3/6] Add pushed exec process events
    │
    o  #18086 [2/6] Support piped stdin in exec process API
    │
    o  #18085 [1/6] Add MCP server environment config
    │
    o  main
    ```
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • [2/8] Support piped stdin in exec process API (#18086)
    ## Summary
    - Add an explicit stdin mode to process/start.
    - Keep normal non-interactive exec stdin closed while allowing
    pipe-backed processes.
    
    ## Stack
    ```text
    o  #18027 [8/8] Fail exec client operations after disconnect
    │
    o  #18025 [7/8] Cover MCP stdio tests with executor placement
    │
    o  #18089 [6/8] Wire remote MCP stdio through executor
    │
    o  #18088 [5/8] Add executor process transport for MCP stdio
    │
    o  #18087 [4/8] Abstract MCP stdio server launching
    │
    o  #18020 [3/8] Add pushed exec process events
    │
    @  #18086 [2/8] Support piped stdin in exec process API
    │
    o  #18085 [1/8] Add MCP server environment config
    │
    o  main
    ```
    
    Co-authored-by: Codex <noreply@openai.com>
  • Build remote exec env from exec-server policy (#17216)
    ## Summary
    - add an exec-server `envPolicy` field; when present, the server starts
    from its own process env and applies the shell environment policy there
    - keep `env` as the exact environment for local/embedded starts, but
    make it an overlay for remote unified-exec starts
    - move the shell-environment-policy builder into `codex-config` so Core
    and exec-server share the inherit/filter/set/include behavior
    - overlay only runtime/sandbox/network deltas from Core onto the
    exec-server-derived env
    
    ## Why
    Remote unified exec was materializing the shell env inside Core and
    forwarding the whole map to exec-server, so remote processes could
    inherit the orchestrator machine's `HOME`, `PATH`, etc. This keeps the
    base env on the executor while preserving Core-owned runtime additions
    like `CODEX_THREAD_ID`, unified-exec defaults, network proxy env, and
    sandbox marker env.
    
    ## Validation
    - `just fmt`
    - `git diff --check`
    - `cargo test -p codex-exec-server --lib`
    - `cargo test -p codex-core --lib unified_exec::process_manager::tests`
    - `cargo test -p codex-core --lib exec_env::tests`
    - `cargo test -p codex-core --lib exec_env_tests` (compile-only; filter
    matched 0 tests)
    - `cargo test -p codex-config --lib shell_environment` (compile-only;
    filter matched 0 tests)
    - `just bazel-lock-update`
    
    ## Known local validation issue
    - `just bazel-lock-check` is not runnable in this checkout: it invokes
    `./scripts/check-module-bazel-lock.sh`, which is missing.
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
    Co-authored-by: pakrym-oai <pakrym@openai.com>
  • feat: move exec-server ownership (#16344)
    This introduces session-scoped ownership for exec-server so ws
    disconnects no longer immediately kill running remote exec processes,
    and it prepares the protocol for reconnect-based resume.
    - add session_id / resume_session_id to the exec-server initialize
    handshake
      - move process ownership under a shared session registry
    - detach sessions on websocket disconnect and expire them after a TTL
    instead of killing processes immediately (we will resume based on this)
    - allow a new connection to resume an existing session and take over
    notifications/ownership
    - I use UUID to make them not predictable as we don't have auth for now
    - make detached-session expiry authoritative at resume time so teardown
    wins at the TTL boundary
    - reject long-poll process/read calls that get resumed out from under an
    older attachment
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • feat: use ProcessId in exec-server (#15866)
    Use a full struct for the ProcessId to increase readability and make it
    easier in the future to make it evolve if needed
  • feat: exec-server prep for unified exec (#15691)
    This PR partially rebase `unified_exec` on the `exec-server` and adapt
    the `exec-server` accordingly.
    
    ## What changed in `exec-server`
    
    1. Replaced the old "broadcast-driven; process-global" event model with
    process-scoped session events. The goal is to be able to have dedicated
    handler for each process.
    2. Add to protocol contract to support explicit lifecycle status and
    stream ordering:
    - `WriteResponse` now returns `WriteStatus` (Accepted, UnknownProcess,
    StdinClosed, Starting) instead of a bool.
      - Added seq fields to output/exited notifications.
      - Added terminal process/closed notification.
    3. Demultiplexed remote notifications into per-process channels. Same as
    for the event sys
    4. Local and remote backends now both implement ExecBackend.
    5. Local backend wraps internal process ID/operations into per-process
    ExecProcess objects.
    6. Remote backend registers a session channel before launch and
    unregisters on failed launch.
    
    ## What changed in `unified_exec`
    
    1. Added unified process-state model and backend-neutral process
    wrapper. This will probably disappear in the future, but it makes it
    easier to keep the work flowing on both side.
    - `UnifiedExecProcess` now handles both local PTY sessions and remote
    exec-server processes through a shared `ProcessHandle`.
    - Added `ProcessState` to track has_exited, exit_code, and terminal
    failure message consistently across backends.
    2. Routed write and lifecycle handling through process-level methods.
    
    ## Some rationals
    
    1. The change centralizes execution transport in exec-server while
    preserving policy and orchestration ownership in core, avoiding
    duplicated launch approval logic. This comes from internal discussion.
    2. Session-scoped events remove coupling/cross-talk between processes
    and make stream ordering and terminal state explicit (seq, closed,
    failed).
    3. The failure-path surfacing (remote launch failures, write failures,
    transport disconnects) makes command tool output and cleanup behavior
    deterministic
    
    ## Follow-ups:
    * Unify the concept of thread ID behind an obfuscated struct
    * FD handling
    * Full zsh-fork compatibility
    * Full network sandboxing compatibility
    * Handle ws disconnection
  • Split exec process into local and remote implementations (#15233)
    ## Summary
    - match the exec-process structure to filesystem PR #15232
    - expose `ExecProcess` on `Environment`
    - make `LocalProcess` the real implementation and `RemoteProcess` a thin
    network proxy over `ExecServerClient`
    - make `ProcessHandler` a thin RPC adapter delegating to `LocalProcess`
    - add a shared local/remote process test
    
    ## Validation
    - `just fmt`
    - `CARGO_TARGET_DIR=~/.cache/cargo-target/codex cargo test -p
    codex-exec-server`
    - `just fix -p codex-exec-server`
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>