13 Commits

  • [codex] Trace exec-server JSON-RPC requests (#27466)
    ## Why
    
    Exec-server JSON-RPC calls can cross local and remote transports, but
    trace context stopped at the RPC boundary. That made client and server
    work difficult to correlate when diagnosing latency or failures.
    
    ## What changed
    
    - Propagate the current W3C trace context on outbound JSON-RPC requests.
    - Parent inbound request spans from received trace context.
    - Record the received JSON-RPC method on server spans and keep each span
    open through response enqueue.
    - Add only the OTEL dependencies required by the exec-server crate.
    
    ## Stack
    
    Review and land this stack in order:
    
    1. #27466 — trace exec-server JSON-RPC requests **(this PR)**
    2. #27467 — record bounded connection, request, and process lifecycle
    metrics
    3. #27470 — observe remote registration and Noise rendezvous lifecycle
    
    ## Validation
    
    - `just test -p codex-exec-server --lib` (153 passed)
    - `just bazel-lock-check`
    - `just fix -p codex-exec-server`
  • protocol: separate app and exec RPC ownership (#29714)
    ## Why
    
    The app-server and exec-server expose separate JSON-RPC APIs, but
    exec-server currently sources its serialized protocol and envelope types
    through app-server-oriented code. Giving each API an explicit owner
    makes the crate boundary legible without introducing shared generic
    envelopes.
    
    ## What changed
    
    - Added `codex-exec-server-protocol` to own exec DTOs, process IDs, and
    JSON-RPC envelopes.
    - Updated exec-server clients, transports, handlers, and tests to use
    the new crate.
    - Exposed app-server's existing JSON-RPC types through a public `rpc`
    module while retaining root re-exports.
    - Preserved existing wire shapes, including exec `PathUri` behavior.
    
    ## Stack
    
    This is PR 1 of 6. Next: [PR
    #29721](https://github.com/openai/codex/pull/29721), which moves auth
    mode below the app wire boundary.
    
    ## Validation
    
    - Exec-server protocol and server coverage passed in the focused
    protocol test runs.
    - App-server protocol schema fixtures passed.
  • Resume exec-server sessions after disconnect (#28512)
    Supersedes #28288 (closed).
    
    ## Why
    
    A short WebSocket interruption currently ends every client-side process
    handle, even though exec-server keeps the server session and its
    processes alive for a short time.
    
    This is especially visible for executor-backed stdio MCP servers: a
    temporary connection loss becomes a permanent `Transport closed` error.
    The server already has the information needed to resume the session, but
    the client opens a fresh session instead of using it.
    
    This change reconnects below the process and MCP layers. Existing
    process handles stay valid, missed output is recovered, and the same
    server-side processes continue running.
    
    ## State machine
    
    One logical `ExecServerClient` stays alive while its underlying RPC
    connection changes generations.
    
    ```text
                             transport closes
           +------------------------------------------------+
           |                                                v
    +-------------+                                  +-------------+
    |  Connected  |                                  | Recovering  |
    +-------------+                                  +-------------+
           ^                                                |
           | session resumed, processes caught up           | retryable error
           +------------------------------------------------+ loops until deadline
                                                            |
                                                            | deadline or permanent error
                                                            v
                                                      +-------------+
                                                      |   Failed    |
                                                      +-------------+
    ```
    
    ### `Connected`
    
    - New RPC calls use the current connection.
    - Process notifications are published in sequence order.
    - A disconnect only starts recovery if it came from the current
    connection generation. Late events from older generations cannot replace
    the active connection.
    
    ### `Recovering`
    
    - New calls wait instead of choosing a half-connected RPC client.
    - Existing process handles, wake subscriptions, and event subscriptions
    stay open.
    - Streaming HTTP response bodies fail immediately because their byte
    streams cannot be resumed safely.
    - Recovery first waits for process starts that were already in flight. A
    start whose result became ambiguous is cleaned up after reconnection
    instead of being silently adopted.
    - The client reconnects with the learned `session_id`. The server may
    briefly report that the old connection is still attached, so that error
    is retried until the detach finishes.
    - The notification consumer starts before the resume handshake
    completes. This prevents a busy process from filling the notification
    queue and blocking the initialize response.
    - Before installing the new connection, the client catches up every
    recoverable process with `process/read`.
    
    ### `Failed`
    
    - Recovery stops after 25 seconds or after a permanent error.
    - Waiting calls are released with one stable disconnect error.
    - Existing process sessions receive a terminal failure instead of
    waiting forever.
    
    ## Recovering process events
    
    Output, exit, and close events share one sequence. During normal
    operation, the client buffers early events until every lower sequence
    has been published.
    
    After reconnection, the client reads each process starting after its
    last published sequence:
    
    1. Retained output chunks are inserted by sequence number.
    2. Exit and close state are reconstructed in their sequence positions.
    3. Events already received as live notifications are ignored as
    duplicates.
    4. Newly contiguous events are published in order.
    5. If the server no longer retains enough output to fill a sequence gap,
    only that process is terminated and failed. The recovered connection
    remains usable for other processes.
    
    The server reports its full next event sequence for unbounded reads,
    including exit and close events. Closed processes remain readable for
    the same 30-second window used to retain detached sessions.
    
    ## Other details
    
    - Detached server sessions are retained for 30 seconds, leaving margin
    around the client's 25-second recovery deadline.
    - Session attach and detach update the active notification sender under
    the same attachment lock, so an old connection cannot clear a newly
    attached sender.
    - A dedicated error code distinguishes the temporary "session is still
    attached" race from permanent initialization errors.
    - Process starts are identity-checked on both client and server. Cleanup
    from an older start cannot remove a newer process that reused the same
    ID.
    - Mutating requests that were already in flight when the transport
    closed are not replayed, because the client cannot know whether the
    server applied them. Requests started after recovery is known wait for
    the replacement connection.
    - We assume the server/client version stays in sync (on the before/after
    this PR)
    
    ## User impact
    
    Long-running commands and stdio MCP servers can survive a temporary
    exec-server WebSocket interruption without changing process IDs or
    losing output produced during the outage.
  • [exec-server] serve websocket listener via HTTP upgrade (#21963)
    ## Why
    
    `codex exec-server` should keep the existing public `ws://IP:PORT` URL
    shape while serving that websocket connection through an HTTP upgrade
    path internally. That keeps the client-facing configuration simple and
    allows the listener to work through intermediate HTTP-aware
    infrastructure.
    
    ## What changed
    
    - keep the emitted and configured exec-server URL as `ws://IP:PORT`
    - serve that websocket endpoint through Axum HTTP upgrade handling on
    `/`
    - expose `GET /readyz` from the same listener for readiness checks
    - route upgraded Axum websocket streams through the shared JSON-RPC
    connection machinery
    - initialize the rustls crypto provider before websocket client
    connections
    - preserve inbound binary websocket JSON-RPC parsing for compatibility
    with the prior transport behavior
    
    ## Verification
    
    - `cargo test -p codex-exec-server --test health --test process --test
    websocket --test initialize --test exec_process`
  • exec-server: preserve fs helper runtime env (#18380)
    ## Summary
    - preserve a small fs-helper runtime env allowlist (`PATH`, temp vars)
    instead of launching the sandboxed helper with an empty env
    - add unit coverage for the allowlist and transformed sandbox request
    env
    - add a Linux smoke test that starts the test exec-server with a fake
    `bwrap` on `PATH`, runs a sandboxed fs write through the remote fs
    helper path, and asserts that bwrap path was exercised
    
    ## Validation
    - `cd /tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
    PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
    && bazel test --bes_backend= --bes_results_url=
    //codex-rs/exec-server:exec-server-file_system-test
    --test_filter=sandboxed_file_system_helper_finds_bwrap_on_preserved_path`
    - `cd /tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
    PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
    && bazel test --bes_backend= --bes_results_url=
    //codex-rs/exec-server:exec-server-unit-tests
    --test_filter="helper_env|sandbox_exec_request_carries_helper_env"`
    - earlier on this branch before the smoke-test harness adjustment: `cd
    /tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
    PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
    && bazel test --bes_backend= --bes_results_url=
    //codex-rs/exec-server:all`
    
    Co-authored-by: Codex <noreply@openai.com>
  • Stabilize exec-server filesystem tests in CI (#17671)
    ## Summary\n- add an exec-server package-local test helper binary that
    can run exec-server and fs-helper flows\n- route exec-server filesystem
    tests through that helper instead of cross-crate codex helper
    binaries\n- stop relying on Bazel-only extra binary wiring for these
    tests\n\n## Testing\n- not run (per repo guidance for codex changes)
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • Stabilize exec-server process tests (#17605)
    Problem: After #17294 switched exec-server tests to launch the top-level
    `codex exec-server` command, parallel remote exec-process cases can
    flake while waiting for the child server's listen URL or transport
    shutdown.
    
    Solution: Serialize remote exec-server-backed process tests and harden
    the harness so spawned servers are killed on drop and shutdown waits for
    the child process to exit.
  • Run exec-server fs operations through sandbox helper (#17294)
    ## Summary
    - run exec-server filesystem RPCs requiring sandboxing through a
    `codex-fs` arg0 helper over stdin/stdout
    - keep direct local filesystem execution for `DangerFullAccess` and
    external sandbox policies
    - remove the standalone exec-server binary path in favor of top-level
    arg0 dispatch/runtime paths
    - add sandbox escape regression coverage for local and remote filesystem
    paths
    
    ## Validation
    - `just fmt`
    - `git diff --check`
    - remote devbox: `cd codex-rs && bazel test --bes_backend=
    --bes_results_url= //codex-rs/exec-server:all` (6/6 passed)
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • feat: move exec-server ownership (#16344)
    This introduces session-scoped ownership for exec-server so ws
    disconnects no longer immediately kill running remote exec processes,
    and it prepares the protocol for reconnect-based resume.
    - add session_id / resume_session_id to the exec-server initialize
    handshake
      - move process ownership under a shared session registry
    - detach sessions on websocket disconnect and expire them after a TTL
    instead of killing processes immediately (we will resume based on this)
    - allow a new connection to resume an existing session and take over
    notifications/ownership
    - I use UUID to make them not predictable as we don't have auth for now
    - make detached-session expiry authoritative at resume time so teardown
    wins at the TTL boundary
    - reject long-poll process/read calls that get resumed out from under an
    older attachment
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • Refactor ExecServer filesystem split between local and remote (#15232)
    For each feature we have:
    1. Trait exposed on environment
    2. **Local Implementation** of the trait
    3. Remote implementation that uses the client to proxy via network
    4. Handler implementation that handles PRC requests and calls into
    **Local Implementation**
  • Remove stdio transport from exec server (#15119)
    Summary
    - delete the deprecated stdio transport plumbing from the exec server
    stack
    - add a basic `exec_server()` harness plus test utilities to start a
    server, send requests, and await events
    - refresh exec-server dependencies, configs, and documentation to
    reflect the new flow
    
    Testing
    - Not run (not requested)
    
    ---------
    
    Co-authored-by: starr-openai <starr@openai.com>
    Co-authored-by: Codex <noreply@openai.com>