## Why
Exec-server JSON-RPC calls can cross local and remote transports, but
trace context stopped at the RPC boundary. That made client and server
work difficult to correlate when diagnosing latency or failures.
## What changed
- Propagate the current W3C trace context on outbound JSON-RPC requests.
- Parent inbound request spans from received trace context.
- Record the received JSON-RPC method on server spans and keep each span
open through response enqueue.
- Add only the OTEL dependencies required by the exec-server crate.
## Stack
Review and land this stack in order:
1. #27466 — trace exec-server JSON-RPC requests **(this PR)**
2. #27467 — record bounded connection, request, and process lifecycle
metrics
3. #27470 — observe remote registration and Noise rendezvous lifecycle
## Validation
- `just test -p codex-exec-server --lib` (153 passed)
- `just bazel-lock-check`
- `just fix -p codex-exec-server`
## Why
The app-server and exec-server expose separate JSON-RPC APIs, but
exec-server currently sources its serialized protocol and envelope types
through app-server-oriented code. Giving each API an explicit owner
makes the crate boundary legible without introducing shared generic
envelopes.
## What changed
- Added `codex-exec-server-protocol` to own exec DTOs, process IDs, and
JSON-RPC envelopes.
- Updated exec-server clients, transports, handlers, and tests to use
the new crate.
- Exposed app-server's existing JSON-RPC types through a public `rpc`
module while retaining root re-exports.
- Preserved existing wire shapes, including exec `PathUri` behavior.
## Stack
This is PR 1 of 6. Next: [PR
#29721](https://github.com/openai/codex/pull/29721), which moves auth
mode below the app wire boundary.
## Validation
- Exec-server protocol and server coverage passed in the focused
protocol test runs.
- App-server protocol schema fixtures passed.
Supersedes #28288 (closed).
## Why
A short WebSocket interruption currently ends every client-side process
handle, even though exec-server keeps the server session and its
processes alive for a short time.
This is especially visible for executor-backed stdio MCP servers: a
temporary connection loss becomes a permanent `Transport closed` error.
The server already has the information needed to resume the session, but
the client opens a fresh session instead of using it.
This change reconnects below the process and MCP layers. Existing
process handles stay valid, missed output is recovered, and the same
server-side processes continue running.
## State machine
One logical `ExecServerClient` stays alive while its underlying RPC
connection changes generations.
```text
transport closes
+------------------------------------------------+
| v
+-------------+ +-------------+
| Connected | | Recovering |
+-------------+ +-------------+
^ |
| session resumed, processes caught up | retryable error
+------------------------------------------------+ loops until deadline
|
| deadline or permanent error
v
+-------------+
| Failed |
+-------------+
```
### `Connected`
- New RPC calls use the current connection.
- Process notifications are published in sequence order.
- A disconnect only starts recovery if it came from the current
connection generation. Late events from older generations cannot replace
the active connection.
### `Recovering`
- New calls wait instead of choosing a half-connected RPC client.
- Existing process handles, wake subscriptions, and event subscriptions
stay open.
- Streaming HTTP response bodies fail immediately because their byte
streams cannot be resumed safely.
- Recovery first waits for process starts that were already in flight. A
start whose result became ambiguous is cleaned up after reconnection
instead of being silently adopted.
- The client reconnects with the learned `session_id`. The server may
briefly report that the old connection is still attached, so that error
is retried until the detach finishes.
- The notification consumer starts before the resume handshake
completes. This prevents a busy process from filling the notification
queue and blocking the initialize response.
- Before installing the new connection, the client catches up every
recoverable process with `process/read`.
### `Failed`
- Recovery stops after 25 seconds or after a permanent error.
- Waiting calls are released with one stable disconnect error.
- Existing process sessions receive a terminal failure instead of
waiting forever.
## Recovering process events
Output, exit, and close events share one sequence. During normal
operation, the client buffers early events until every lower sequence
has been published.
After reconnection, the client reads each process starting after its
last published sequence:
1. Retained output chunks are inserted by sequence number.
2. Exit and close state are reconstructed in their sequence positions.
3. Events already received as live notifications are ignored as
duplicates.
4. Newly contiguous events are published in order.
5. If the server no longer retains enough output to fill a sequence gap,
only that process is terminated and failed. The recovered connection
remains usable for other processes.
The server reports its full next event sequence for unbounded reads,
including exit and close events. Closed processes remain readable for
the same 30-second window used to retain detached sessions.
## Other details
- Detached server sessions are retained for 30 seconds, leaving margin
around the client's 25-second recovery deadline.
- Session attach and detach update the active notification sender under
the same attachment lock, so an old connection cannot clear a newly
attached sender.
- A dedicated error code distinguishes the temporary "session is still
attached" race from permanent initialization errors.
- Process starts are identity-checked on both client and server. Cleanup
from an older start cannot remove a newer process that reused the same
ID.
- Mutating requests that were already in flight when the transport
closed are not replayed, because the client cannot know whether the
server applied them. Requests started after recovery is known wait for
the replacement connection.
- We assume the server/client version stays in sync (on the before/after
this PR)
## User impact
Long-running commands and stdio MCP servers can survive a temporary
exec-server WebSocket interruption without changing process IDs or
losing output produced during the outage.
## Why
`codex exec-server` should keep the existing public `ws://IP:PORT` URL
shape while serving that websocket connection through an HTTP upgrade
path internally. That keeps the client-facing configuration simple and
allows the listener to work through intermediate HTTP-aware
infrastructure.
## What changed
- keep the emitted and configured exec-server URL as `ws://IP:PORT`
- serve that websocket endpoint through Axum HTTP upgrade handling on
`/`
- expose `GET /readyz` from the same listener for readiness checks
- route upgraded Axum websocket streams through the shared JSON-RPC
connection machinery
- initialize the rustls crypto provider before websocket client
connections
- preserve inbound binary websocket JSON-RPC parsing for compatibility
with the prior transport behavior
## Verification
- `cargo test -p codex-exec-server --test health --test process --test
websocket --test initialize --test exec_process`
## Summary
- preserve a small fs-helper runtime env allowlist (`PATH`, temp vars)
instead of launching the sandboxed helper with an empty env
- add unit coverage for the allowlist and transformed sandbox request
env
- add a Linux smoke test that starts the test exec-server with a fake
`bwrap` on `PATH`, runs a sandboxed fs write through the remote fs
helper path, and asserts that bwrap path was exercised
## Validation
- `cd /tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
&& bazel test --bes_backend= --bes_results_url=
//codex-rs/exec-server:exec-server-file_system-test
--test_filter=sandboxed_file_system_helper_finds_bwrap_on_preserved_path`
- `cd /tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
&& bazel test --bes_backend= --bes_results_url=
//codex-rs/exec-server:exec-server-unit-tests
--test_filter="helper_env|sandbox_exec_request_carries_helper_env"`
- earlier on this branch before the smoke-test harness adjustment: `cd
/tmp/codex-worktrees/fs-helper-env-defaults/codex-rs && export
PATH=$HOME/code/openai/project/dotslash-gen/bin:$HOME/.local/bin:$PATH
&& bazel test --bes_backend= --bes_results_url=
//codex-rs/exec-server:all`
Co-authored-by: Codex <noreply@openai.com>
## Summary\n- add an exec-server package-local test helper binary that
can run exec-server and fs-helper flows\n- route exec-server filesystem
tests through that helper instead of cross-crate codex helper
binaries\n- stop relying on Bazel-only extra binary wiring for these
tests\n\n## Testing\n- not run (per repo guidance for codex changes)
---------
Co-authored-by: Codex <noreply@openai.com>
Problem: After #17294 switched exec-server tests to launch the top-level
`codex exec-server` command, parallel remote exec-process cases can
flake while waiting for the child server's listen URL or transport
shutdown.
Solution: Serialize remote exec-server-backed process tests and harden
the harness so spawned servers are killed on drop and shutdown waits for
the child process to exit.
This introduces session-scoped ownership for exec-server so ws
disconnects no longer immediately kill running remote exec processes,
and it prepares the protocol for reconnect-based resume.
- add session_id / resume_session_id to the exec-server initialize
handshake
- move process ownership under a shared session registry
- detach sessions on websocket disconnect and expire them after a TTL
instead of killing processes immediately (we will resume based on this)
- allow a new connection to resume an existing session and take over
notifications/ownership
- I use UUID to make them not predictable as we don't have auth for now
- make detached-session expiry authoritative at resume time so teardown
wins at the TTL boundary
- reject long-poll process/read calls that get resumed out from under an
older attachment
---------
Co-authored-by: Codex <noreply@openai.com>
For each feature we have:
1. Trait exposed on environment
2. **Local Implementation** of the trait
3. Remote implementation that uses the client to proxy via network
4. Handler implementation that handles PRC requests and calls into
**Local Implementation**
Summary
- delete the deprecated stdio transport plumbing from the exec server
stack
- add a basic `exec_server()` harness plus test utilities to start a
server, send requests, and await events
- refresh exec-server dependencies, configs, and documentation to
reflect the new flow
Testing
- Not run (not requested)
---------
Co-authored-by: starr-openai <starr@openai.com>
Co-authored-by: Codex <noreply@openai.com>