Files
codex/codex-rs/exec-server
T
jif cf17e1bc20 Resume exec-server sessions after disconnect (#28512)
Supersedes #28288 (closed).

## Why

A short WebSocket interruption currently ends every client-side process
handle, even though exec-server keeps the server session and its
processes alive for a short time.

This is especially visible for executor-backed stdio MCP servers: a
temporary connection loss becomes a permanent `Transport closed` error.
The server already has the information needed to resume the session, but
the client opens a fresh session instead of using it.

This change reconnects below the process and MCP layers. Existing
process handles stay valid, missed output is recovered, and the same
server-side processes continue running.

## State machine

One logical `ExecServerClient` stays alive while its underlying RPC
connection changes generations.

```text
                         transport closes
       +------------------------------------------------+
       |                                                v
+-------------+                                  +-------------+
|  Connected  |                                  | Recovering  |
+-------------+                                  +-------------+
       ^                                                |
       | session resumed, processes caught up           | retryable error
       +------------------------------------------------+ loops until deadline
                                                        |
                                                        | deadline or permanent error
                                                        v
                                                  +-------------+
                                                  |   Failed    |
                                                  +-------------+
```

### `Connected`

- New RPC calls use the current connection.
- Process notifications are published in sequence order.
- A disconnect only starts recovery if it came from the current
connection generation. Late events from older generations cannot replace
the active connection.

### `Recovering`

- New calls wait instead of choosing a half-connected RPC client.
- Existing process handles, wake subscriptions, and event subscriptions
stay open.
- Streaming HTTP response bodies fail immediately because their byte
streams cannot be resumed safely.
- Recovery first waits for process starts that were already in flight. A
start whose result became ambiguous is cleaned up after reconnection
instead of being silently adopted.
- The client reconnects with the learned `session_id`. The server may
briefly report that the old connection is still attached, so that error
is retried until the detach finishes.
- The notification consumer starts before the resume handshake
completes. This prevents a busy process from filling the notification
queue and blocking the initialize response.
- Before installing the new connection, the client catches up every
recoverable process with `process/read`.

### `Failed`

- Recovery stops after 25 seconds or after a permanent error.
- Waiting calls are released with one stable disconnect error.
- Existing process sessions receive a terminal failure instead of
waiting forever.

## Recovering process events

Output, exit, and close events share one sequence. During normal
operation, the client buffers early events until every lower sequence
has been published.

After reconnection, the client reads each process starting after its
last published sequence:

1. Retained output chunks are inserted by sequence number.
2. Exit and close state are reconstructed in their sequence positions.
3. Events already received as live notifications are ignored as
duplicates.
4. Newly contiguous events are published in order.
5. If the server no longer retains enough output to fill a sequence gap,
only that process is terminated and failed. The recovered connection
remains usable for other processes.

The server reports its full next event sequence for unbounded reads,
including exit and close events. Closed processes remain readable for
the same 30-second window used to retain detached sessions.

## Other details

- Detached server sessions are retained for 30 seconds, leaving margin
around the client's 25-second recovery deadline.
- Session attach and detach update the active notification sender under
the same attachment lock, so an old connection cannot clear a newly
attached sender.
- A dedicated error code distinguishes the temporary "session is still
attached" race from permanent initialization errors.
- Process starts are identity-checked on both client and server. Cleanup
from an older start cannot remove a newer process that reused the same
ID.
- Mutating requests that were already in flight when the transport
closed are not replayed, because the client cannot know whether the
server applied them. Requests started after recovery is known wait for
the replacement connection.
- We assume the server/client version stays in sync (on the before/after
this PR)

## User impact

Long-running commands and stdio MCP servers can survive a temporary
exec-server WebSocket interruption without changing process IDs or
losing output produced during the outage.
cf17e1bc20 ยท 2026-06-17 10:20:39 +02:00
History
..

codex-exec-server

codex-exec-server is the library backing codex exec-server, a small JSON-RPC server for spawning and controlling subprocesses through codex-utils-pty.

It provides:

  • a CLI entrypoint: codex exec-server
  • a Rust client: ExecServerClient
  • a small protocol module with shared request/response types

This crate owns the transport, protocol, and filesystem/process handlers. The top-level codex binary owns hidden helper dispatch for sandboxed filesystem operations and codex-linux-sandbox.

Transport

The server speaks the shared codex-app-server-protocol message envelope on the wire.

The CLI entrypoint supports:

  • ws://IP:PORT (default)
  • --remote URL --environment-id ID [--name NAME]

Remote mode registers the local exec-server with the environment registry, then reconnects to the service-provided rendezvous websocket as the environment. Remote communication uses the Noise relay contract; the registry and harness must support it. It uses the standard Codex ChatGPT sign-in state; run codex login first when remote registration needs authentication. Containerized callers that receive an Agent Identity JWT in CODEX_ACCESS_TOKEN can opt into that auth path with --use-agent-identity-auth; Codex then registers an Agent task and sends the derived AgentAssertion headers on the registry request.

Alternatively, API users can instead use CODEX_API_KEY; Codex sends it as a bearer token on the registration request. For example:

CODEX_API_KEY="$OPENAI_API_KEY" \
codex exec-server \
  --remote ... \
  --environment-id "$ENVIRONMENT_ID"

Wire framing:

  • local websocket: one JSON-RPC message per websocket frame
  • Noise remote websocket: binary protobuf relay frames carrying encrypted payloads

Remote Relay Message Format

In remote mode, the harness and environment communicate through rendezvous using codex.exec_server.relay.v1.RelayMessageFrame; the checked-in schema is in src/proto/codex.exec_server.relay.v1.proto. The relay frame carries stream identity plus endpoint-owned reliability metadata:

version
stream_id
body              // handshake | data | ack_frame | resume | reset | heartbeat
ack               // highest contiguous peer segment seq received
ack_bits          // bitset for peer segment seqs after ack
seq               // data only: segment sequence number
segment_index     // data only: 0-based index within message
segment_count     // data only: number of segments in message
payload           // handshake bytes or encrypted data record
next_seq          // resume only: next sender seq
reason            // reset only: reset reason

stream_id identifies one virtual harness/environment JSON-RPC session on the environment websocket. The harness generates a UUIDv4 stream_id; the environment demuxes frames by stream_id and runs an independent ConnectionProcessor per stream.

Use segment-level sequence numbers for reliability:

seq = 0, 1, 2, 3, ...

Use contiguous segment sequence ranges to identify and stitch a segmented application message:

message_start_seq = seq - segment_index
segment_index = 0
segment_count = 1

message_start_seq is derived by the receiver, not sent on the wire. For unsplit messages, message_start_seq == seq, segment_index == 0, and segment_count == 1.

Use cumulative ack plus fixed-size ack_bits instead of variable ack ranges:

ack = highest contiguous received segment seq
bit i in ack_bits acknowledges seq = ack + 1 + i

Send ack and ack_bits redundantly on every outbound frame. Acks are not themselves acked. Acks, retries, duplicate suppression, segmentation, and reassembly are endpoint responsibilities; rendezvous only routes relay frames by stream_id.

Lifecycle

Each connection follows this sequence:

  1. Send initialize.
  2. Wait for the initialize response.
  3. Send initialized.
  4. Call process or filesystem RPCs.

If the server receives any notification other than initialized, it replies with an error using request id -1.

If the websocket connection closes, the server terminates any remaining managed processes for that client connection.

API

initialize

Initial handshake request.

Request params:

{
  "clientName": "my-client"
}

Response:

{}

initialized

Handshake acknowledgement notification sent by the client after a successful initialize response.

Params are currently ignored. Sending any other notification method is treated as an invalid request.

process/start

Starts a new managed process.

Request params:

{
  "processId": "proc-1",
  "argv": ["bash", "-lc", "printf 'hello\\n'"],
  "cwd": "file:///absolute/working/directory",
  "env": {
    "PATH": "/usr/bin:/bin"
  },
  "tty": true,
  "pipeStdin": false,
  "arg0": null
}

Field definitions:

  • processId: caller-chosen stable id for this process within the connection.
  • argv: command vector. It must be non-empty.
  • cwd: file: URI for the child process working directory.
  • env: environment variables passed to the child process.
  • tty: when true, spawn a PTY-backed interactive process.
  • pipeStdin: when true, keep non-PTY stdin writable via process/write.
  • arg0: optional argv0 override forwarded to codex-utils-pty.

Response:

{
  "processId": "proc-1"
}

Behavior notes:

  • Reusing an existing processId is rejected.
  • PTY-backed processes accept later writes through process/write.
  • Non-PTY processes reject writes unless pipeStdin is true.
  • Output is streamed asynchronously via process/output.
  • Exit is reported asynchronously via process/exited.

process/read

Reads buffered output and terminal state for a managed process.

Request params:

{
  "processId": "proc-1",
  "afterSeq": null,
  "maxBytes": 65536,
  "waitMs": 1000
}

Field definitions:

  • processId: managed process id returned by process/start.
  • afterSeq: optional sequence number cursor; when present, only newer chunks are returned.
  • maxBytes: optional response byte budget.
  • waitMs: optional long-poll timeout in milliseconds.

Response:

{
  "chunks": [],
  "nextSeq": 1,
  "exited": false,
  "exitCode": null,
  "closed": false,
  "failure": null
}

process/write

Writes raw bytes to a running process stdin.

Request params:

{
  "processId": "proc-1",
  "chunk": "aGVsbG8K"
}

chunk is base64-encoded raw bytes. In the example above it is hello\n.

Response:

{
  "status": "accepted"
}

Behavior notes:

  • Writes to an unknown processId are rejected.
  • Writes to a non-PTY process are rejected unless it started with pipeStdin.

process/terminate

Terminates a running managed process.

Request params:

{
  "processId": "proc-1"
}

Response:

{
  "running": true
}

If the process is already unknown or already removed, the server responds with:

{
  "running": false
}

Notifications

process/output

Streaming output chunk from a running process.

Params:

{
  "processId": "proc-1",
  "seq": 1,
  "stream": "stdout",
  "chunk": "aGVsbG8K"
}

Fields:

  • processId: process identifier
  • seq: per-process output sequence number
  • stream: "stdout", "stderr", or "pty"
  • chunk: base64-encoded output bytes

process/exited

Final process exit notification.

Params:

{
  "processId": "proc-1",
  "seq": 2,
  "exitCode": 0
}

process/closed

Notification emitted after process output is closed and the process handle is removed.

Params:

{
  "processId": "proc-1"
}

Filesystem RPCs

Filesystem methods use canonical file: URIs and return JSON-RPC errors for invalid or unavailable paths. For compatibility, requests also accept native absolute path strings and normalize them to file: URIs:

  • fs/readFile
  • fs/open, fs/readBlock, and fs/close (internal transport for ExecutorFileSystem::read_file_stream)
  • fs/writeFile
  • fs/createDirectory
  • fs/getMetadata
  • fs/canonicalize
  • fs/readDirectory
  • fs/remove
  • fs/copy

Each filesystem request accepts an optional sandbox object. When sandbox contains a ReadOnly or WorkspaceWrite policy, the operation runs in a hidden helper process launched from the top-level codex executable and prepared through the shared sandbox transform path. Helper requests and responses are passed over stdin/stdout.

Errors

The server returns JSON-RPC errors with these codes:

  • -32600: invalid request
  • -32602: invalid params
  • -32603: internal error

Typical error cases:

  • unknown method
  • malformed params
  • empty argv
  • duplicate processId
  • writes to unknown processes
  • writes to non-PTY processes
  • sandbox-denied filesystem operations

Rust surface

The crate exports:

  • ExecServerClient
  • ExecServerError
  • ExecServerClientConnectOptions
  • RemoteExecServerConnectArgs
  • protocol request/response structs for process and filesystem RPCs
  • DEFAULT_LISTEN_URL and ExecServerListenUrlParseError
  • ExecServerRuntimePaths
  • run_main() for embedding the websocket server
  • RemoteEnvironmentConfig and run_remote_environment() for embedding remote registration mode

Callers must pass ExecServerRuntimePaths to run_main(). The top-level codex exec-server command builds these paths from the codex arg0 dispatch state. RemoteEnvironmentConfig::new(...) also takes the auth provider that remote registration should use; the CLI builds that provider from Codex auth state before starting remote mode.

Example session

Initialize:

{"id":1,"method":"initialize","params":{"clientName":"example-client"}}
{"id":1,"result":{}}
{"method":"initialized","params":{}}

Start a process:

{"id":2,"method":"process/start","params":{"processId":"proc-1","argv":["bash","-lc","printf 'ready\\n'; while IFS= read -r line; do printf 'echo:%s\\n' \"$line\"; done"],"cwd":"file:///tmp","env":{"PATH":"/usr/bin:/bin"},"tty":true,"pipeStdin":false,"arg0":null}}
{"id":2,"result":{"processId":"proc-1"}}
{"method":"process/output","params":{"processId":"proc-1","seq":1,"stream":"stdout","chunk":"cmVhZHkK"}}

Write to the process:

{"id":3,"method":"process/write","params":{"processId":"proc-1","chunk":"aGVsbG8K"}}
{"id":3,"result":{"status":"accepted"}}
{"method":"process/output","params":{"processId":"proc-1","seq":2,"stream":"stdout","chunk":"ZWNobzpoZWxsbwo="}}

Terminate it:

{"id":4,"method":"process/terminate","params":{"processId":"proc-1"}}
{"id":4,"result":{"running":true}}
{"method":"process/exited","params":{"processId":"proc-1","seq":3,"exitCode":0}}
{"method":"process/closed","params":{"processId":"proc-1"}}