mirror of
https://github.com/pchuan98/codex.git
synced 2026-07-01 00:31:56 +08:00
ccf1a18518
## Background Codex can use **Auto Review** for permission requests. Instead of asking the user immediately, Codex starts a separate locked-down reviewer session called **Guardian**, which returns a structured `allow` or `deny` assessment. The Guardian reviewer is itself a Codex session, so its model request can fail for transient infrastructure reasons such as model overload, HTTP connection failure, or response-stream disconnect. Today, any such failure immediately ends the Auto Review attempt and blocks the action. This PR adds bounded retries for failures that the existing protocol explicitly identifies as transient. Linear context: [CA-539](https://linear.app/openai/issue/CA-539/retry-auto-review-infrastructure-failures-and-fall-back-to-manual) ## What changes A Guardian review can now make at most **three total attempts**: 1. Run the review normally. 2. Retry after a jittered delay of roughly 180–220 ms if the first attempt fails with an eligible error. 3. Retry after a jittered delay of roughly 360–440 ms if the second attempt also fails with an eligible error. All attempts share the original review deadline. Jitter spreads retries from concurrent clients to reduce synchronized load during broader outages. The retries do not reset the user's maximum wait time, and the backoff waits terminate early if the review is cancelled or the deadline expires. Before retrying, the existing Guardian session lifecycle decides whether the session remains usable. Healthy trunks are reused, broken trunks are removed by the existing cleanup path, and ephemeral sessions continue to clean themselves up. The review still emits one logical lifecycle to clients. Recoverable intermediate failures do not produce warnings or terminal events. ## Retry policy ### Retried up to twice - model/server overload - HTTP connection failure - response-stream connection failure - response-stream disconnect - internal server error - a final reviewer message that cannot be parsed as the required Guardian assessment ### Not retried - bad or invalid requests - authentication failures - usage limits - cyber-policy failures - errors without a structured category - a request that already exhausted the lower-level Responses retry budget - a completed Guardian turn with no assessment payload - prompt-construction failures - Guardian review timeout - cancellation or abort - a valid `deny` assessment The session-error classification uses `ErrorEvent.codex_error_info`; it does not inspect error-message strings. ## Implementation notes - `wait_for_guardian_review` preserves the complete `ErrorEvent`, including structured `codex_error_info`. - Guardian session failures preserve the original message and optional structured `CodexErrorInfo`. - The retry policy classifies the explicitly transient `CodexErrorInfo` variants; unknown, absent, and deterministic categories are not retried. - The Guardian session manager receives the caller's deadline rather than creating a new timeout per attempt. - Analytics record the final `attempt_count`. - Retry orchestration does not add a separate session-cleanup protocol; it relies on the existing trunk and ephemeral lifecycle decisions. ## Automated testing Focused Guardian coverage verifies: - every supported transient `CodexErrorInfo` is classified as retryable, while absent and non-transient categories are not; - structured transient session failure -> retry -> approval with the healthy trunk reused; - two invalid Guardian responses -> third attempt -> approval, with exactly three requests; - three invalid responses -> existing fail-closed result, with exactly three requests and one terminal lifecycle; - valid denial, missing payload, invalid request, timeout, cancellation, and prompt/session construction failures are not retried; - retry eligibility ends after the third attempt; - retry delays use the shared exponential backoff helper and remain within the expected jitter bounds; - cancellation and deadline expiry interrupt the backoff wait; - healthy trunks are reused across retryable failures; - broken event streams remove the trunk through the existing lifecycle cleanup; - an ephemeral retry does not disturb a concurrent trunk review. Validation performed: - `just test -p codex-core guardian_review_ guardian_ephemeral_retry_preserves_parallel_trunk_and_fork_history run_review_removes_trunk_when_event_stream_is_broken` — **42 passed**; - `just test -p codex-analytics` — **71 passed**; - scoped Clippy fixes for `codex-core` and `codex-analytics` passed. A prior full `codex-core` run had unrelated environment-sensitive failures outside Guardian coverage. ## Manual QA The focused integration tests use the local mock Responses server to inspect exact request counts and emitted lifecycle events. They confirm that retries are internal, a successful later attempt supplies the final decision, non-retryable failures issue only one request, and exhausted retries emit only one terminal result.
ccf1a18518
·
2026-06-10 11:46:57 -07:00
History