Files
codex/codex-rs/analytics
T
kbazzi ccf1a18518 [codex] Retry transient Guardian review failures (#27062)
## Background

Codex can use **Auto Review** for permission requests. Instead of asking
the user immediately, Codex starts a separate locked-down reviewer
session called **Guardian**, which returns a structured `allow` or
`deny` assessment.

The Guardian reviewer is itself a Codex session, so its model request
can fail for transient infrastructure reasons such as model overload,
HTTP connection failure, or response-stream disconnect. Today, any such
failure immediately ends the Auto Review attempt and blocks the action.

This PR adds bounded retries for failures that the existing protocol
explicitly identifies as transient.

Linear context:
[CA-539](https://linear.app/openai/issue/CA-539/retry-auto-review-infrastructure-failures-and-fall-back-to-manual)

## What changes

A Guardian review can now make at most **three total attempts**:

1. Run the review normally.
2. Retry after a jittered delay of roughly 180–220 ms if the first
attempt fails with an eligible error.
3. Retry after a jittered delay of roughly 360–440 ms if the second
attempt also fails with an eligible error.

All attempts share the original review deadline. Jitter spreads retries
from concurrent clients to reduce synchronized load during broader
outages. The retries do not reset the user's maximum wait time, and the
backoff waits terminate early if the review is cancelled or the deadline
expires.

Before retrying, the existing Guardian session lifecycle decides whether
the session remains usable. Healthy trunks are reused, broken trunks are
removed by the existing cleanup path, and ephemeral sessions continue to
clean themselves up.

The review still emits one logical lifecycle to clients. Recoverable
intermediate failures do not produce warnings or terminal events.

## Retry policy

### Retried up to twice

- model/server overload
- HTTP connection failure
- response-stream connection failure
- response-stream disconnect
- internal server error
- a final reviewer message that cannot be parsed as the required
Guardian assessment

### Not retried

- bad or invalid requests
- authentication failures
- usage limits
- cyber-policy failures
- errors without a structured category
- a request that already exhausted the lower-level Responses retry
budget
- a completed Guardian turn with no assessment payload
- prompt-construction failures
- Guardian review timeout
- cancellation or abort
- a valid `deny` assessment

The session-error classification uses `ErrorEvent.codex_error_info`; it
does not inspect error-message strings.

## Implementation notes

- `wait_for_guardian_review` preserves the complete `ErrorEvent`,
including structured `codex_error_info`.
- Guardian session failures preserve the original message and optional
structured `CodexErrorInfo`.
- The retry policy classifies the explicitly transient `CodexErrorInfo`
variants; unknown, absent, and deterministic categories are not retried.
- The Guardian session manager receives the caller's deadline rather
than creating a new timeout per attempt.
- Analytics record the final `attempt_count`.
- Retry orchestration does not add a separate session-cleanup protocol;
it relies on the existing trunk and ephemeral lifecycle decisions.

## Automated testing

Focused Guardian coverage verifies:

- every supported transient `CodexErrorInfo` is classified as retryable,
while absent and non-transient categories are not;
- structured transient session failure -> retry -> approval with the
healthy trunk reused;
- two invalid Guardian responses -> third attempt -> approval, with
exactly three requests;
- three invalid responses -> existing fail-closed result, with exactly
three requests and one terminal lifecycle;
- valid denial, missing payload, invalid request, timeout, cancellation,
and prompt/session construction failures are not retried;
- retry eligibility ends after the third attempt;
- retry delays use the shared exponential backoff helper and remain
within the expected jitter bounds;
- cancellation and deadline expiry interrupt the backoff wait;
- healthy trunks are reused across retryable failures;
- broken event streams remove the trunk through the existing lifecycle
cleanup;
- an ephemeral retry does not disturb a concurrent trunk review.

Validation performed:

- `just test -p codex-core guardian_review_
guardian_ephemeral_retry_preserves_parallel_trunk_and_fork_history
run_review_removes_trunk_when_event_stream_is_broken` — **42 passed**;
- `just test -p codex-analytics` — **71 passed**;
- scoped Clippy fixes for `codex-core` and `codex-analytics` passed.

A prior full `codex-core` run had unrelated environment-sensitive
failures outside Guardian coverage.

## Manual QA

The focused integration tests use the local mock Responses server to
inspect exact request counts and emitted lifecycle events. They confirm
that retries are internal, a successful later attempt supplies the final
decision, non-retryable failures issue only one request, and exhausted
retries emit only one terminal result.
ccf1a18518 · 2026-06-10 11:46:57 -07:00
History
..