mirror of
https://github.com/pchuan98/codex.git
synced 2026-07-01 00:31:56 +08:00
cf17e1bc20
Supersedes #28288 (closed). ## Why A short WebSocket interruption currently ends every client-side process handle, even though exec-server keeps the server session and its processes alive for a short time. This is especially visible for executor-backed stdio MCP servers: a temporary connection loss becomes a permanent `Transport closed` error. The server already has the information needed to resume the session, but the client opens a fresh session instead of using it. This change reconnects below the process and MCP layers. Existing process handles stay valid, missed output is recovered, and the same server-side processes continue running. ## State machine One logical `ExecServerClient` stays alive while its underlying RPC connection changes generations. ```text transport closes +------------------------------------------------+ | v +-------------+ +-------------+ | Connected | | Recovering | +-------------+ +-------------+ ^ | | session resumed, processes caught up | retryable error +------------------------------------------------+ loops until deadline | | deadline or permanent error v +-------------+ | Failed | +-------------+ ``` ### `Connected` - New RPC calls use the current connection. - Process notifications are published in sequence order. - A disconnect only starts recovery if it came from the current connection generation. Late events from older generations cannot replace the active connection. ### `Recovering` - New calls wait instead of choosing a half-connected RPC client. - Existing process handles, wake subscriptions, and event subscriptions stay open. - Streaming HTTP response bodies fail immediately because their byte streams cannot be resumed safely. - Recovery first waits for process starts that were already in flight. A start whose result became ambiguous is cleaned up after reconnection instead of being silently adopted. - The client reconnects with the learned `session_id`. The server may briefly report that the old connection is still attached, so that error is retried until the detach finishes. - The notification consumer starts before the resume handshake completes. This prevents a busy process from filling the notification queue and blocking the initialize response. - Before installing the new connection, the client catches up every recoverable process with `process/read`. ### `Failed` - Recovery stops after 25 seconds or after a permanent error. - Waiting calls are released with one stable disconnect error. - Existing process sessions receive a terminal failure instead of waiting forever. ## Recovering process events Output, exit, and close events share one sequence. During normal operation, the client buffers early events until every lower sequence has been published. After reconnection, the client reads each process starting after its last published sequence: 1. Retained output chunks are inserted by sequence number. 2. Exit and close state are reconstructed in their sequence positions. 3. Events already received as live notifications are ignored as duplicates. 4. Newly contiguous events are published in order. 5. If the server no longer retains enough output to fill a sequence gap, only that process is terminated and failed. The recovered connection remains usable for other processes. The server reports its full next event sequence for unbounded reads, including exit and close events. Closed processes remain readable for the same 30-second window used to retain detached sessions. ## Other details - Detached server sessions are retained for 30 seconds, leaving margin around the client's 25-second recovery deadline. - Session attach and detach update the active notification sender under the same attachment lock, so an old connection cannot clear a newly attached sender. - A dedicated error code distinguishes the temporary "session is still attached" race from permanent initialization errors. - Process starts are identity-checked on both client and server. Cleanup from an older start cannot remove a newer process that reused the same ID. - Mutating requests that were already in flight when the transport closed are not replayed, because the client cannot know whether the server applied them. Requests started after recovery is known wait for the replacement connection. - We assume the server/client version stays in sync (on the before/after this PR) ## User impact Long-running commands and stdio MCP servers can survive a temporary exec-server WebSocket interruption without changing process IDs or losing output produced during the outage.
cf17e1bc20
ยท
2026-06-17 10:20:39 +02:00
History