127 Commits

  • [codex] Propagate traces through exec-server HTTP (#30117)
    Fixes distributed trace continuity across exec-server JSON-RPC HTTP
    egress by adding an executor client span and injecting its W3C context
    through a reusable `codex-otel` helper.
    
    This preserves the caller trace across core/tool → executor →
    provider/MCP instead of dropping parentage at raw reqwest.
    
    Note that this doesn't include the websocket path, which is needed to
    really get the full story but at least we cover the basic http path with
    this change.
  • [codex] Record exec-server lifecycle metrics (#27467)
    ## Summary
    
    - Record bounded connection, request, and process lifecycle metrics.
    - Report active gauges from callbacks on every collection, including
    delta exports.
    - Serialize active-count updates so concurrent starts and finishes
    cannot publish stale values.
    - Serialize process exit, explicit termination, and shutdown through the
    process registry so exactly one completion result wins.
    - Keep the implementation small with single-owner RAII guards and one
    real OTLP/HTTP integration test using the existing `wiremock`
    dependency.
    
    ## Root cause
    
    Process exit and session shutdown previously used cloned completion
    state. That avoided duplicate emission, but it duplicated lifecycle
    ownership and made the ordering harder to reason about. The process
    registry mutex already defines the lifecycle ordering, so the final
    implementation stores the metric guard and termination flag directly on
    the process entry. Whichever path claims the entry first owns the
    completion result.
    
    Production metric export uses delta temporality. Event-only synchronous
    gauge recordings disappear after the next collection when no count
    changes, so active counts now use observable callbacks that report
    current state on every collection.
    
    The cleanup also removes the constant `result="accepted"` connection
    tag, redundant route and response assertions, a custom HTTP collector,
    and fallback initialization machinery that did not add behavior.
    
    ## Stack
    
    Review and land this stack in order:
    
    1. #27466 — trace exec-server JSON-RPC requests
    2. #27467 — record bounded connection, request, and process lifecycle
    metrics **(this PR)**
    3. #27470 — observe remote registration and Noise rendezvous lifecycle
    
    ## Validation
    
    - `just test -p codex-exec-server --lib` (158 passed)
    - `just test -p codex-cli --test exec_server` (3 passed)
    - `just test -p codex-otel
    observable_gauge_is_collected_on_every_delta_snapshot` (1 passed)
    - `CARGO_BUILD_JOBS=1 just fix -p codex-otel -p codex-exec-server`
    - `just fmt`
    - `git diff --check`
  • feat: use run agent task auth for inference (#19051)
    ## Stack
    
    This is PR 3 of the simplified HAI single-run-task stack:
    
    - [#19047](https://github.com/openai/codex/pull/19047) Agent Identity
    assertion and task-registration primitives, including the shared
    run-task helper used by existing Agent Identity JWT auth.
    - [#19049](https://github.com/openai/codex/pull/19049)
    Disabled-by-default ChatGPT auth opt-in that provisions/reuses persisted
    Agent Identity runtime auth and its single run task.
    - [#19051](https://github.com/openai/codex/pull/19051) Run-scoped
    provider auth that uses one backend-owned task id for first-party
    inference and compaction requests.
    
    [#19054](https://github.com/openai/codex/pull/19054) collapsed out of
    the active stack because the simplified design no longer needs a
    separate background/control-plane task helper.
    
    ## Summary
    
    This PR moves Agent Identity usage into provider auth resolution. That
    keeps `AgentAssertion` auth tied to first-party OpenAI provider requests
    instead of applying a late session-wide override that could affect
    local, custom, Bedrock, API-key, or external-bearer providers.
    
    What changed:
    
    - adds a small `ProviderAuthScope` struct carrying the run auth policy
    and session source needed by provider-scoped auth resolution
    - lets `Session` opt the existing `ModelClient` into `ChatGptAuth`
    policy when `use_agent_identity` is enabled, without adding a second
    model-client constructor
    - resolves Agent Identity only for first-party OpenAI provider auth
    paths
    - uses the persisted run task id from the `AgentIdentityAuth` record to
    build `AgentAssertion` auth for Responses requests
    - routes shared request setup through scoped provider auth so unary
    compact requests use the same run-task assertion path as inference turns
    - keeps local/custom/Bedrock/env-key/external-bearer provider auth
    unchanged
    - lets missing run-task state surface through the existing model-request
    error path instead of silently falling back to bearer auth
    
    This PR intentionally does not create thread-scoped, target-scoped, or
    background-scoped task identities. The run task is the only task Codex
    registers in this POC shape.
    
    ## Testing
    
    - `just test -p codex-model-provider`
    - `just test -p codex-core client::tests::provider_auth_scope_uses`
    - `just test -p codex-core remote_compact_uses_agent_identity_assertion`
  • auth: move domain mode below app wire types (#29721)
    ## Why
    
    Authentication mode is a domain concept used by login, model selection,
    telemetry, and transports. Keeping the canonical type in app-server
    protocol forces those lower-level crates to depend on an unrelated wire
    API.
    
    ## What changed
    
    - Added canonical `codex_protocol::auth::AuthMode` domain values.
    - Kept the app-server wire DTO unchanged and added an explicit app-side
    conversion.
    - Removed production app-server-protocol dependencies from login,
    model-provider-info, models-manager, and otel call paths.
    
    ## Stack
    
    This is PR 2 of 6, stacked on [PR
    #29714](https://github.com/openai/codex/pull/29714). Review only the
    delta from `codex/split-json-rpc-protocols`. Next: [PR
    #29722](https://github.com/openai/codex/pull/29722).
    
    ## Validation
    
    - Auth and login coverage passed in the focused protocol/domain test
    run.
    - App-server account and auth conversion coverage passed.
  • [codex] Use input items for Responses Lite tools (#27946)
    When using Responses Lite, we should all use `additional_tools` and a
    developer item instead of the top level tools array & instructions
    field. This keeps things 1-to-1.
    
    Forced namespacing for _all_ tools will land in a following PR after
    some coordination & fixes in Responses API (around collisions & return
    items).
    
    The goal is to eventually expand the scope of this to _all_ requests
    from codex, but that will require larger coordination across providers &
    slower rollout.
  • [codex] Expose service tier and reasoning effort in OTEL (#29155)
    ## Summary
    
    NVIDIA asked to measure Fast mode usage and reasoning effort from Codex
    CLI OTEL logs. Add the finalized `service_tier` and
    `model_reasoning_effort` to the existing `codex.sse_event`
    `response.completed` record.
    
    This intentionally reuses the existing completion event and leaves
    transport APIs and shared telemetry plumbing unchanged.
    
    ## Testing
    
    - `cargo build -p codex-cli --bin codex`
    - `just test -p codex-core responses_api_emits_api_request_event`
    - End-to-end with the built CLI and a local OTLP/HTTP collector:
    - Fast/high emitted `service_tier=priority` and
    `model_reasoning_effort=high` with token usage.
    - Standard/low omitted `service_tier` and emitted
    `model_reasoning_effort=low` with token usage.
  • Stop logging every Responses WebSocket event (#29432)
    ## Why
    
    Every successful Responses WebSocket event currently produces three
    local log records: the full payload at TRACE, an OpenTelemetry log
    event, and an OpenTelemetry trace event.
    
    On busy threads these records fill the 1,000-row log partition in
    seconds and cause continuous SQLite insert-and-prune churn.
    
    Related to
    https://openai.slack.com/archives/C095U48JNL9/p1782128972644209
    
    ## What changed
    
    - Stop logging each successful Responses WebSocket payload at TRACE.
    - Stop emitting `codex.websocket_event` as OpenTelemetry log and trace
    events.
    - Keep WebSocket event counters, duration metrics, response timing
    metrics, parsing, and error handling.
  • Propagate safety buffering events to app-server clients (#29371)
    Responses API safety buffering metadata currently stops at the transport
    boundary, so app-server clients cannot render the in-progress safety
    review state.
    
    This change:
    - decodes and deduplicates `safety_buffering` metadata from Responses
    API SSE and WebSocket events without suppressing the original response
    event
    - emits a typed core event containing the requested model plus backend
    use cases and reasons
    - forwards that event as `turn/safetyBuffering/updated` through
    app-server v2 and updates generated protocol schemas
    - keeps the side-channel event out of persisted rollouts and turn timing
    
    This supports the Codex Apps buffering UX and depends on the Responses
    API backend work in https://github.com/openai/openai/pull/1044569 and
    https://github.com/openai/openai/pull/1044571.
    
    Validation:
    - focused `codex-core` safety-buffering integration test passes
    - `cargo check -p codex-core -p codex-app-server -p
    codex-app-server-protocol`
    - `just fix -p codex-api -p codex-protocol -p codex-core -p
    codex-app-server-protocol -p codex-app-server -p codex-rollout -p
    codex-rollout-trace -p codex-otel`
    - `just fmt`
    - broad package test run: 4,430/4,492 passed; 62 unrelated
    local-environment/concurrency failures involved unavailable test
    binaries, MCP subprocess setup, and app-server timeouts
  • [codex] Use expect in integration tests (#28441)
    The workspace denies `clippy::expect_used` in production. Although
    `clippy.toml` allows `expect` in tests, Bazel Clippy compiles
    integration-test helper code in a way that does not receive that
    exemption, which encouraged verbose `unwrap_or_else(... panic!(...))`
    and equivalent `match`/`let else` forms.
    
    This allows `clippy::expect_used` once at each integration-test crate
    root (including aggregated suites and test-support libraries), then
    replaces manual panic-based Result and Option unwraps with
    `expect`/`expect_err`. Standalone `tests/*.rs` files remain their own
    crate roots. Intentional assertion and unexpected-variant panics remain
    unchanged, and the production `expect_used = "deny"` lint remains in
    place.
    
    The cleanup is mechanical and net-negative in line count.
  • [codex] Add second-based OTEL duration histograms (#27058)
    ## Why
    
    Exec-server request and connection latencies need fractional-second
    histograms. The existing duration API records integer milliseconds and
    uses millisecond-scale buckets.
    
    ## What changed
    
    - Adds a described duration API that records `Duration` values as
    fractional seconds.
    - Uses second-scale explicit histogram boundaries.
    - Caches duration histograms by name, unit, and description, matching
    the existing instrument caching model.
    - Covers exact boundaries, representative bucket placement, fractional
    sums, and exported metadata.
    
    This PR only adds the duration primitive. It does not add exec-server
    adoption.
    
    ## Stack
    
    1. #26091: counter descriptions
    2. #27057: gauge instruments
    3. **#27058: second-based duration histograms**
    4. #25019: initialize exec-server OpenTelemetry at startup
    
    Related independent coverage: #27059 tests OTLP HTTP log and trace event
    export.
    
    ## Validation
    
    - `just test -p codex-otel`
  • feat(core): add metadata field to ResponseItem (#28355)
    ## Description
    
    This PR adds an optional `metadata` field to `ResponseItem` for
    Responses API calls. Only mechanical plumbing, no actual values
    populated and sent yet. Turns out just adding a new field to
    `ResponseItem` has quite a large blast radius already.
    
    This change is backwards compatible because `metadata` is optional and
    omitted when absent, so existing response items and rollout history
    without it still deserialize and requests that do not set it keep the
    same wire shape. For provider compatibility, we strip out `metadata`
    before non-OpenAI Responses requests so Azure and AWS Bedrock never see
    this field.
    
    My followup PR here will actually make use of it to start storing and
    passing along `turn_id`: https://github.com/openai/codex/pull/28360
    
    ## What changed
    
    - Added `ResponseItemMetadata` with optional `turn_id`, plus optional
    `metadata` on Responses API item variants and inter-agent communication.
    - Preserved item metadata through response-item rewrites such as
    truncation, missing tool-output synthesis, compaction history
    rebuilding, visible-history conversion, rollout/resume, and generated
    app-server schemas/types.
    - Strip item metadata from non-OpenAI Responses requests while
    preserving it for OpenAI-shaped requests.
    - Updated the mechanical fixture/test construction churn required by the
    new optional field.
  • [codex] Cover OTLP HTTP log and trace event export (#27059)
    ## Why
    
    The generic OTLP HTTP paths for log events and trace events need
    end-to-end coverage before exec-server relies on them.
    
    ## What changed
    
    - Adds loopback coverage for exporting `codex_otel.log_only` events to
    `/v1/logs`.
    - Verifies `codex_otel.trace_safe` events are present in the exported
    trace payload.
    
    This is a test-only PR. It does not change OTEL runtime behavior or
    metric APIs.
    
    ## Related work
    
    - #26091: counter descriptions
    - #27057: gauge instruments
    - #27058: second-based duration histograms
    
    This PR is independent and can land directly on `main`.
    
    ## Validation
    
    - `just test -p codex-otel`
    - `just fix -p codex-otel`
    - `just fmt`
  • feat: add Bedrock API key as a managed auth mode (#27443)
    ## Why
    
    Codex needs to manage Amazon Bedrock API key credentials through the
    existing auth lifecycle instead of introducing a separate auth manager
    or provider-specific credential file. Treating Bedrock API key login as
    a primary auth mode gives it the same persistence, keyring, reload, and
    logout behavior as the existing OpenAI API key and ChatGPT modes.
    
    The credential is valid only for the `amazon-bedrock` model provider.
    OpenAI-compatible providers must reject this auth mode rather than
    treating the Bedrock key as an OpenAI bearer token.
    
    ## What changed
    
    - Added `bedrockApiKey` as an app-server `AuthMode` and
    `CodexAuth::BedrockApiKey` as a primary `AuthManager` mode.
    - Added `BedrockApiKeyAuth`, containing the API key and AWS region, to
    the existing `AuthDotJson` payload stored in `$CODEX_HOME/auth.json` or
    the configured keyring backend.
    - Added `login_with_bedrock_api_key(...)`, parallel to
    `login_with_api_key(...)`, which replaces the current stored login with
    Bedrock credentials.
    - Reused generic auth reload and logout behavior instead of adding a
    Bedrock-specific auth manager or logout path.
    - Updated login restrictions, status reporting, diagnostics, telemetry
    classification, generated app-server schemas, and auth fixtures for the
    new mode.
    - Added explicit errors when Bedrock API key auth is selected with an
    OpenAI-compatible model provider.
    
    This PR establishes managed storage and auth-mode behavior. Routing the
    managed key and region into Amazon Bedrock requests will be in follow-up
    PRs.
  • [codex] Add reusable OTEL gauge instruments (#27057)
    ## Why
    
    Exec-server observability needs current-value measurements in addition
    to counters. The reusable OTEL client should expose that primitive
    without coupling it to exec-server runtime behavior.
    
    ## What changed
    
    - Adds integer gauge instruments, with optional descriptions.
    - Caches gauges by name and description so instrument metadata remains
    part of the declaration identity.
    - Covers gauge values, descriptions, merged attributes, and OTLP HTTP
    export.
    
    This PR only adds the gauge primitive. It does not add second-based
    duration histograms or exec-server adoption.
    
    ## Stack
    
    1. #26091: counter descriptions
    2. **#27057: gauge instruments**
    3. #27058: second-based duration histograms
    
    Related independent coverage: #27059 tests OTLP HTTP log and trace event
    export.
    
    ## Validation
    
    - `just test -p codex-otel`
    - `just fix -p codex-otel`
    - `just fmt`
  • [codex] Add OTEL counter descriptions (#26091)
    ## Why
    
    Metric descriptions should be declared with reusable OTEL instruments
    instead of being coupled to individual consumers. Counter descriptions
    are the smallest API primitive needed by the exec-server observability
    work.
    
    ## What changed
    
    - Adds `counter_with_description` while preserving the existing counter
    API.
    - Caches counters by name and description so instrument metadata remains
    part of the declaration identity.
    - Covers the exported description together with the existing value and
    attribute contract.
    
    This PR only adds counter descriptions. It does not add gauges,
    second-based durations, or exec-server adoption.
    
    ## Stack
    
    1. **#26091: counter descriptions**
    2. #27057: gauge instruments
    3. #27058: second-based duration histograms
    
    Related independent coverage: #27059 tests OTLP HTTP log and trace event
    export.
    
    The `codex-exec-server` bounded service tag now stays with the
    exec-server adoption change instead of this reusable infrastructure
    stack.
    
    ## Validation
    
    - `just test -p codex-otel`
    - `just fix -p codex-otel`
    - `just fmt`
  • [codex-rs] support v2 personal access tokens (#25731)
    ## Summary
    
    - add v2 personal access token support for `codex login
    --with-access-token` and `CODEX_ACCESS_TOKEN`
    - classify opaque `at-` tokens separately from legacy Agent Identity
    JWTs
    - hydrate required ChatGPT account metadata through AuthAPI
    `/v1/user-auth-credential/whoami`
    - use PATs directly as bearer tokens while preserving existing ChatGPT
    account surfaces
    - expose PAT-backed auth as the explicit `personalAccessToken`
    app-server auth mode
    
    ## Implementation
    
    PAT auth is intentionally small and stateless. Loading a PAT performs
    one AuthAPI metadata request, stores the hydrated metadata in the
    in-memory auth object, and redacts the secret from debug output. Legacy
    Agent Identity JWT handling remains unchanged. The shared access-token
    classifier lives in a private neutral module because it dispatches
    between both credential types.
    
    PAT hydration fails closed when AuthAPI omits any required metadata,
    including email. Hydrated metadata is intentionally not persisted:
    startup performs a live `whoami` preflight so revoked tokens or changed
    account metadata are not accepted from a stale cache.
    
    ## Workspace restriction scope
    
    This change intentionally does **not** apply
    `forced_chatgpt_workspace_id` to PAT authentication. The setting is a
    client-side config guardrail, not an authorization boundary, and PAT
    does not currently require workspace-ID parity. The PAT login and
    `CODEX_ACCESS_TOKEN` paths therefore validate through AuthAPI without
    threading workspace-restriction state through access-token loading.
    Existing workspace checks for non-PAT auth remain on their established
    paths.
    
    ## App-server compatibility
    
    The public app-server `AuthMode` is shared across v1 and v2, and
    PAT-backed auth reports `personalAccessToken` through both APIs.
    Following human review, this intentionally removes the temporary v1
    compatibility mapping that reported PATs as `chatgpt`; the deprecated v1
    API is kept in parity with v2 rather than maintaining a separate closed
    enum. Clients with exhaustive auth-mode handling in either API version
    must add the new case and should generally treat it as ChatGPT-backed
    unless they need PAT-specific behavior.
    
    The v1 auth-status response still omits the raw PAT when `includeToken`
    is requested because that response cannot carry the account metadata
    needed to reuse the credential safely. Persisted PAT auth also omits the
    new enum value so older Codex builds can deserialize `auth.json` and
    infer PAT auth from the credential field after a rollback.
    
    ## Validation
    
    Latest review-fix validation:
    
    - `CARGO_INCREMENTAL=0 just test -p codex-login` (126 passed)
    - `CARGO_INCREMENTAL=0 just test -p codex-cli` (263 passed)
    - `CARGO_INCREMENTAL=0 just test -p codex-cli
    stored_auth_validation_handles_personal_access_token`
    - `CARGO_INCREMENTAL=0 just test -p codex-app-server-protocol` (226
    passed)
    - `CARGO_INCREMENTAL=0 just test -p codex-models-manager
    refresh_available_models_uses_remote_only_catalog_for_chatgpt_auth`
    - `CARGO_INCREMENTAL=0 just test -p codex-tui
    existing_non_oauth_chatgpt_login_counts_as_signed_in`
    - `CARGO_INCREMENTAL=0 just fix -p codex-login -p
    codex-app-server-protocol -p codex-models-manager -p codex-tui -p
    codex-cli`
    - `just fmt`
    - `git diff --check`
    
    The broader `codex-tui` suite previously compiled and ran 2,834 tests.
    Three unrelated environment-sensitive guardian/IDE-socket tests failed
    after retries; the PAT-relevant TUI coverage passed.
  • [codex] Forward turn moderation metadata through app-server (#25710)
    ## Why
    First-party backends can supply turn-scoped moderation metadata that
    app-server clients need for client-side presentation. Exposing this as
    an experimental typed notification lets opted-in clients consume it
    without interpreting raw Responses API events.
    
    ## What changed
    - forward `response.metadata.openai_chatgpt_moderation_metadata` from
    Responses API SSE and WebSocket streams as turn-scoped moderation
    metadata
    - emit the experimental app-server v2 `turn/moderationMetadata`
    notification with `{ threadId, turnId, metadata }`
    - add app-server integration coverage for the typed moderation metadata
    notification
    
    ## Testing
    - `just test -p codex-core
    build_ws_client_metadata_includes_window_lineage_and_turn_metadata`
    - `just test -p codex-core` (fails locally: 46 failures and 1 timeout,
    primarily missing `test_stdio_server` and shell snapshot timeouts)
    - `just test -p codex-app-server-protocol`
    - `just test -p codex-app-server
    turn_moderation_metadata_emits_typed_notification_v2`
    - `just test -p codex-app-server` (fails locally: 792 passed, 10 failed,
    and 5 timed out; failures are in existing environment-sensitive tests,
    primarily because nested macOS `sandbox-exec` is not permitted)
    - `just write-app-server-schema --experimental --schema-root
    /tmp/codex-app-server-schema-experimental`
  • Encrypt multi-agent v2 message payloads (#26210)
    ## Why
    
    Multi-agent v2 currently routes agent instructions through normal tool
    arguments and inter-agent context. That means the parent model can emit
    plaintext task text, Codex can persist it in history/rollouts, and the
    recipient can receive it as ordinary assistant-message JSON.
    
    This changes the v2 path so agent instructions stay encrypted between
    model calls: Responses encrypts the `message` argument returned by the
    model, Codex forwards only that ciphertext, and Responses decrypts it
    internally for the recipient model.
    
    ## What changed
    
    - Mark the v2 `message` parameter as encrypted for `spawn_agent`,
    `send_message`, and `followup_task`.
    - Treat multi-agent v2 tool `message` values as ciphertext
    unconditionally.
    - Store v2 inter-agent task text in
    `InterAgentCommunication.encrypted_content` with empty plaintext
    `content`.
    - Convert encrypted inter-agent communications into the Responses
    `agent_message` input item before sending the child request.
    - Preserve `agent_message` items across history, rollout, compaction,
    telemetry, and app-server schema paths.
    - Leave multi-agent v1 unchanged.
    
    ## Message shape
    
    The model still calls the v2 tools with a `message` argument, but that
    value is now ciphertext:
    
    ```json
    {
      "name": "spawn_agent",
      "arguments": {
        "task_name": "worker",
        "message": "<ciphertext>"
      }
    }
    ```
    
    Codex stores the task as encrypted inter-agent communication:
    
    ```json
    {
      "author": "/root",
      "recipient": "/root/worker",
      "content": "",
      "encrypted_content": "<ciphertext>",
      "trigger_turn": true
    }
    ```
    
    When Codex builds the recipient request, it forwards the ciphertext
    using the new Responses input item:
    
    ```json
    {
      "type": "agent_message",
      "author": "/root",
      "recipient": "/root/worker",
      "content": [
        {
          "type": "encrypted_content",
          "encrypted_content": "<ciphertext>"
        }
      ]
    }
    ```
    
    Responses decrypts that item internally for the recipient model.
    
    ## Context impact
    
    - Parent context no longer carries plaintext v2 agent task instructions
    from these tool arguments.
    - Codex rollout/history stores ciphertext for v2 agent instructions.
    - Recipient requests receive an `agent_message` item instead of
    assistant commentary JSON for encrypted task delivery.
    - Plaintext completion/status notifications are still plaintext because
    they are Codex-generated status messages, not encrypted model tool
    arguments.
    
    ## Validation
    
    - `just test -p codex-tools`
    - `just test -p codex-protocol`
    - `just test -p codex-rollout`
    - `just test -p codex-rollout-trace`
    - `just test -p codex-otel`
    - `just write-app-server-schema`
  • [codex] Emit sandbox outcome telemetry event (#25955)
    ## Summary
    
    Adds a dedicated `codex.sandbox_outcome` telemetry event so we can query
    sandbox edge outcomes without threading sandbox metadata through
    tool-result output types.
    
    This is meant to make sandbox failures and approved escalation retries
    visible in OTEL while keeping the existing `codex.tool_result` event
    shape focused on tool completion data.
    
    ## What changed
    
    - Adds `SessionTelemetry::sandbox_outcome(...)`, which emits
    `codex.sandbox_outcome` as both a log and trace event.
    - Records the tool name, call id, sandbox outcome, initial attempt
    duration, and escalated attempt duration when a retry runs.
    - Emits `denied` when the sandbox blocks execution and no retry is run.
    - Emits `timed_out` and `signal` when those sandbox errors surface from
    tool execution.
    - Emits `escalated` when the initial sandboxed attempt fails and the
    approved unsandboxed retry succeeds.
    - Adds OTEL coverage for the new event payload, including timing fields.
    
    ## Validation
    
    - `RUST_MIN_STACK=8388608 just test -p codex-core
    sandbox_outcome_event_records_outcome
    handle_sandbox_error_user_approves_retry_records_tool_decision`
    - `just test -p codex-otel
    otel_export_routing_policy_routes_tool_result_log_and_trace_events
    runtime_metrics_summary_collects_tool_api_and_streaming_metrics`
    - `just fix -p codex-core`
    - `just fix -p codex-otel`
  • [codex] Support model-defined reasoning efforts (#26444)
    ## Summary
    - accept non-empty model-defined reasoning effort values while
    preserving built-in effort behavior
    - propagate the non-Copy effort type through core, app-server, TUI,
    telemetry, and persistence call sites
    - preserve string wire encoding and expose an open-string schema for
    clients
    - update model selection and shortcut behavior for model-advertised
    effort values
    
    ## Root cause
    `ReasoningEffort` gained a string-backed custom variant, so it could no
    longer implement `Copy` or rely on derived closed-enum serialization.
    Existing consumers still moved effort values from shared references and
    assumed a fixed built-in value set.
    
    ## Validation
    - `just fmt`
    - Local tests and compilation were not run per request; relying on CI.
  • Add Guardian review metrics (#24897)
    ## Why
    
    Guardian reviews already emit analytics events, but we do not expose
    aggregate OpenTelemetry metrics for review volume, latency, token usage,
    or terminal outcomes. That makes it harder to monitor Guardian behavior
    during rollouts and to compare review outcomes by source, action type,
    session kind, model, and failure mode.
    
    ## What Changed
    
    - Added Guardian review metric names for count, total duration, time to
    first token, and token usage in `codex-rs/otel`.
    - Added `core/src/guardian/metrics.rs` to convert
    `GuardianReviewAnalyticsResult` into sanitized metric tags covering
    decision, terminal status, failure reason, approval request source,
    reviewed action, session kind, risk/outcome, model, reasoning effort,
    and context/truncation state.
    - Emitted the new metrics from `track_guardian_review` for each terminal
    Guardian review result.
    
    ## Testing
    
    - Added
    `guardian_review_metrics_record_counts_durations_and_token_usage`, which
    verifies the emitted count, duration, TTFT, token usage histograms, and
    tag set through the in-memory metrics exporter.
  • otel: drop legacy profile usage telemetry (#24061)
    ## Summary
    - drop the dead legacy profile usage metric and active-profile
    conversation-start fields
    - update role comments so they describe provider and service-tier
    preservation without legacy config-profile wording
    - pair the code cleanup with the file-backed profile docs update in
    openai/developers-website#1476
    
    ## Testing
    - `just fmt`
    - `cargo test -p codex-otel`
    - `cargo test -p codex-core` *(fails: existing stack overflow in
    `mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message`)*
    - `cargo test -p codex-core --lib
    mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message`
    *(fails with the same stack overflow)*
  • Split plugin install discovery into list and request tools (#23372)
    ## Summary
    - Add `list_available_plugins_to_install` as the inventory step for
    plugin and connector install suggestions.
    - Slim `request_plugin_install` so it only handles the actual
    elicitation, instead of carrying the full discoverable list in its
    prompt.
    - Emit send-time telemetry when an install elicitation is dispatched,
    including requested tool identity in the event payload.
    - Emit install-result telemetry through `SessionTelemetry`, including
    tool type, user response action, and completion status.
    - Update registration and tests to cover the new two-step flow while
    keeping the existing `tool_suggest` feature gate unchanged.
    
    ## Testing
    - `just fmt`
    - `cargo test -p codex-tools`
    - `cargo test -p codex-core request_plugin_install`
    - `cargo test -p codex-core list_available_plugins_to_install`
    - `cargo test -p codex-core
    install_suggestion_tools_can_be_registered_without_search_tool`
    - `cargo test -p codex-otel
    manager_records_plugin_install_suggestion_metric`
    - `cargo test -p codex-otel
    manager_records_plugin_install_elicitation_sent_metric`
    - `just fix -p codex-core`
    - `just fix -p codex-tools`
    - `just fix -p codex-otel`
    - `cargo check -p codex-core`
  • goal: pause continuation loops on usage limits and blockers (#23094)
    Addresses #22833, #22245, #23067
    
    ## Why
    `/goal` can keep synthesizing turns even when the next turn cannot make
    meaningful progress. Hard usage exhaustion can replay failing turns, and
    repeated permission or external-resource blockers can keep burning
    tokens while waiting for user or system intervention.
    
    ## What changed
    - Add resumable `blocked` and `usageLimited` goal states. As with
    `paused`, goal continuation stops with these states.
    - Move to `usageLimited` after usage-limit failures.
    - Allow the built-in `update_goal` tool to set `blocked` only under
    explicit repeated-impasse guidance. Updated goal continuation prompt to
    specify that agent should use `blocked` only when it has made at least
    three attempts to get past an impasse.
    
    Most of the files touched by this PR are because of the small app server
    protocol update.
    
    ## Validation
    
    I manually reproduced a number of situations where an agent can run into
    a true impasse and verified that it properly enters `blocked` state. I
    then resumed and verified that it once again entered `blocked` state
    several turns later if the impasse still exists.
    
    I also manually reproduced the usage-limit condition by creating a
    simulated responses API endpoint that returns 429 errors with the
    appropriate error message. Verified that the goal runtime properly moves
    the goal into `usageLimited` state and TUI UI updates appropriately.
    Verified that `/goal resume` resumes (and immediately goes back into
    `ussageLImited` state if appropriate).
    
    
    ## Follow-up PRs
    
    Small changes will be needed to the GUI clients to properly handle the
    two new states.
  • chore: goal resumed metrics (#23301)
    Add metrics for goal resume
  • Preserve image detail in app-server inputs (#20693)
    ## Summary
    
    - Add optional image detail to user image inputs across core, app-server
    v2, thread history/event mapping, and the generated app-server
    schemas/types.
    - Preserve requested detail when serializing Responses image inputs:
    omitted detail stays on the existing `high` default, while explicit
    `original` keeps local images on the original-resolution path.
    - Support `high`/`original` consistently for tool image outputs,
    including MCP `codex/imageDetail`, code-mode image helpers, and
    `view_image`.
  • [codex] Use compaction_trigger item for remote compaction v2 (#22809)
    ## Why
    
    Remote compaction v2 was still using `context_compaction` as both the
    request trigger and the compacted output shape. The Responses API now
    has the landed contract for this flow: Codex sends a dedicated `{
    "type": "compaction_trigger" }` input item, and the backend returns the
    standard `compaction` output item with encrypted content.
    
    This aligns the v2 path with that wire contract while preserving the
    existing local compacted-history post-processing behavior.
    
    ## What changed
    
    - Add `ResponseItem::CompactionTrigger` and regenerate the app-server
    protocol schema fixtures.
    - Send `compaction_trigger` from `remote_compaction_v2` instead of a
    payload-less `context_compaction`.
    - Collect exactly one backend `compaction` output item, then reuse the
    existing compacted-history rebuilding path.
    - Treat the trigger item as a transient request marker rather than model
    output or persisted rollout/memory content.
    
    ## Verification
    
    - `cargo test -p codex-protocol compaction_trigger`
    - `cargo test -p codex-core remote_compact_v2`
    - `cargo test -p codex-core compact_remote_v2`
    - `cargo test -p codex-core
    responses_websocket_sends_response_processed_after_remote_compaction_v2`
    - `just write-app-server-schema`
    - `cargo test -p codex-app-server-protocol schema_fixtures`
  • Simplify MCP tool handler plumbing (#21595)
    ## Why
    The MCP tool path had accumulated a few core-owned special cases: a
    dedicated payload variant, resolver plumbing, a legacy `AfterToolUse`
    translation path, and a side channel for parallel-call metadata. That
    made `ToolRegistry` and the spec builder know more about MCP than they
    needed to.
    
    This change moves MCP-specific execution details back onto `ToolInfo`
    and `McpHandler` so `codex-core` can treat MCP calls like normal
    function calls while still preserving MCP-specific dispatch and
    telemetry behavior where it belongs.
    
    ## What changed
    - removed `resolve_mcp_tool_info`, `ToolPayload::Mcp`, `ToolKind`, and
    the remaining registry-side MCP resolver path
    - stored MCP routing metadata directly on `McpHandler` and `ToolInfo`,
    including `supports_parallel_tool_calls`
    - deleted the legacy `AfterToolUse` consumer in `core`, which removes
    the need for handler-specific `after_tool_use_payload` implementations
    - switched tool-result telemetry to handler-provided tags and kept
    MCP-specific dispatch payload construction inside the handler
    - simplified tool spec planning/building by passing `ToolInfo` directly
    and dropping the direct/deferred MCP wrapper structs and the
    parallel-server side table
    
    ## Testing
    - `cargo check -p codex-core -p codex-mcp -p codex-otel`
    - `cargo test -p codex-core
    mcp_parallel_support_uses_exact_payload_server`
    - `cargo test -p codex-core
    direct_mcp_tools_register_namespaced_handlers`
    - `cargo test -p codex-core
    search_tool_description_lists_each_mcp_source_once`
    - `cargo test -p codex-mcp
    list_all_tools_uses_startup_snapshot_while_client_is_pending`
    - `just fix -p codex-core -p codex-mcp -p codex-otel`
  • Add production startup and TTFT telemetry (#22198)
    ## Why
    
    While investigating `codex exec hi` startup latency, the useful
    questions were not "is startup slow?" but "which durable bucket is slow
    in production?"
    
    The path we observed has a few distinct stages:
    
    1. `thread/start` creates the session
    2. startup prewarm builds the turn context, tools, and prompt
    3. startup prewarm warms the websocket
    4. the first real turn resolves the prewarm
    5. the model produces the first token
    
    Before this PR, production telemetry had some of the raw measurements
    already:
    
    - aggregate startup-prewarm duration / age-at-first-turn metrics
    - TTFT as a metric
    - websocket request telemetry
    
    But there was no coherent production event stream for the startup
    breakdown itself, and TTFT was metric-only. That made it hard to answer
    the same latency questions from OpenTelemetry-backed logs without adding
    one-off local instrumentation.
    
    ## What changed
    
    Add durable production telemetry on the existing `SessionTelemetry`
    path:
    
    - new `codex.startup_phase` OTel log/trace events plus
    `codex.startup.phase.duration_ms`
    - new `codex.turn_ttft` OTel log/trace events while preserving the
    existing TTFT metric
    
    The startup phase event is emitted for the coarse buckets we actually
    observed while running `exec hi`:
    
    - `thread_start_create_thread`
    - `startup_prewarm_total`
    - `startup_prewarm_create_turn_context`
    - `startup_prewarm_build_tools`
    - `startup_prewarm_build_prompt`
    - `startup_prewarm_websocket_warmup`
    - `startup_prewarm_resolve`
    
    These phases are intentionally low-cardinality so they remain safe as
    production telemetry tags.
    
    ## Why this shape
    
    This keeps the instrumentation on the same production path as the rest
    of the session telemetry instead of adding a local debug-only trace
    mode. It also avoids changing startup behavior:
    
    - prewarm still runs
    - no control flow changes
    - no extra remote calls
    - no user-visible behavior changes
    
    One boundary is intentional: very early process bootstrap that happens
    before a session exists is not included here, because this PR uses
    session-scoped production telemetry. The expensive buckets we were
    trying to understand after `thread/start` are now covered durably.
    
    ## Verification
    
    - `cargo test -p codex-otel`
    - `cargo test -p codex-core turn_timing`
    - `cargo test -p codex-core
    regular_turn_emits_turn_started_without_waiting_for_startup_prewarm`
    - `cargo test -p codex-core
    interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted`
    - `cargo test -p codex-app-server thread_start`
    - `just fix -p codex-otel -p codex-core -p codex-app-server`
    
    I also ran `cargo test -p codex-core`; it built successfully and then
    hit an existing unrelated stack overflow in
    `tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.
  • Add process-scoped SQLite telemetry (#22154)
    ## Summary
    - add SQLite init, backfill-gate, and fallback telemetry without
    introducing a cross-cutting state-db access wrapper
    - install one process-scoped telemetry sink after OTEL startup and let
    low-level state/rollout paths emit through it directly
    - add process-start metrics for the process owners that initialize
    SQLite
    
    ---------
    
    Co-authored-by: Owen Lin <owen@openai.com>
  • codex-otel: validate provider span attributes consistently (#21749)
    Provider initialization installs process-global OTEL state, so invalid
    trace metadata needs to fail before setup begins.
    
    Use the same span attribute validator as config loading when traces are
    exported so provider startup enforces the config contract without
    duplicating validation logic.
  • codex-otel: add configurable trace metadata (#21556)
    Add Codex config for static trace span attributes and structured W3C
    tracestate field upserts. The config flows through OtelSettings so
    callers can attach trace metadata without touching every span call site.
    
    Apply span attributes with an SDK span processor so every exported
    trace span carries the configured metadata. Model tracestate as nested
    member fields so configured keys can be upserted while unrelated
    propagated state in the same member is preserved.
    
    Validate configured tracestate before installing provider-global state,
    including header-unsafe values the SDK does not reject by itself. This
    keeps Codex from propagating malformed trace context from config.
    
    Update the config schema, public docs, and OTLP loopback coverage for
    config parsing, span export, propagation, and invalid-header rejection.
  • revert legacy notify deprecation (#21152)
    # Why
    
    Revert #20524 for now because the computer use plugin has not migrated
    off legacy `notify` yet. Keeping the deprecation in place today would
    show users a warning before the plugin path is ready to move, so this
    rolls the change back until that migration is complete.
    
    # What
    
    - revert the legacy `notify` deprecation change from #20524
    - restore the prior `notify` behavior and remove the temporary
    deprecation metrics/docs from that change
    
    Once the computer use plugin has migrated, we can land the same
    deprecation again.
  • Add goal lifecycle metrics (#20799)
    ## Why
    
    Adding goal metrics makes it possible to track how often goals are
    created, completed, and stopped by budget limits, plus the final token
    and wall-clock usage for terminal outcomes.
    
    ## What Changed
    
    - Added OpenTelemetry metric constants for goal lifecycle tracking:
    - `codex.goal.created`: increments each time a new persisted goal is
    created or an existing goal is replaced with a new objective.
    - `codex.goal.completed`: increments when a goal transitions to
    `complete`.
    - `codex.goal.budget_limited`: increments when a goal transitions to
    `budget_limited` because its token budget has been reached.
    - `codex.goal.token_count`: records the final persisted token count when
    a goal transitions to `complete` or `budget_limited`.
    - `codex.goal.duration_s`: records the final persisted elapsed
    wall-clock time, in seconds, when a goal transitions to `complete` or
    `budget_limited`.
    - Emitted creation metrics when a goal is created or replaced.
    - Emitted terminal outcome counters and final usage histograms when a
    goal transitions to `complete` or `budget_limited`, avoiding
    double-counting later in-flight accounting for already budget-limited
    goals.
    - Added focused `codex-core` tests for create/complete metrics and
    one-time budget-limit metrics.
  • feat: add remote compaction v2 Responses client path (#20773)
    ## Why
    
    This adds the `remote_compaction_v2` client path so remote compaction
    can run through the normal Responses stream and install a
    `context_compaction` item that trigger a compaction.
    
    The goal is to migrate some of the compaction logic on the client side
    
    We keeps the v2 transport behind a feature flag while letting follow-up
    requests reuse the compacted context instead of falling back to the
    legacy compaction item shape.
    
    ## What changed
    
    - add `ResponseItem::ContextCompaction` and refresh the generated
    app-server / schema / TypeScript fixtures that expose response items on
    the wire
    - add `core/src/compact_remote_v2.rs` to send compaction through the
    standard streamed Responses client, require exactly one
    `context_compaction` output item, and install that item into compacted
    history
    - route manual compact and auto-compaction through the v2 path when
    `remote_compaction_v2` is enabled, while keeping the existing remote
    compaction path as the fallback
    - preserve the new item type across history retention, follow-up request
    construction, telemetry, rollout persistence, and rollout-trace
    normalization
    - add targeted coverage for the feature flag, `context_compaction`
    serialization, rollout-trace normalization, and remote-compaction
    follow-up behavior
    
    ## Verification
    
    - added protocol tests for `context_compaction`
    serialization/deserialization in `protocol/src/models.rs`
    - added rollout-trace coverage for `context_compaction` normalization in
    `rollout-trace/src/reducer/conversation_tests.rs`
    - added remote compaction integration coverage for v2 follow-up reuse
    and mixed compaction output streams in
    `core/tests/suite/compact_remote.rs`
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • deprecate legacy notify (#20524)
    # Why
    
    `notify` is the remaining compatibility surface from the legacy hook
    implementation. The newer lifecycle hook engine now owns the active hook
    system, so we should start steering users away from adding new `notify`
    configs before removing the old path entirely. This also adds a
    lightweight watchpoint for the deprecation so we can see how much legacy
    usage remains before the clean drop.
    
    # What
    
    - emit a startup deprecation notice when a non-empty `notify` command is
    configured
    - emit `codex.notify.configured` when a session starts with legacy
    `notify` configured
    - emit `codex.notify.run` when the legacy notify path fires after a
    completed turn
    - mark `notify` as deprecated in the config schema and repo docs
    - remove the orphaned `codex-rs/hooks/src/user_notification.rs` file
    that is no longer compiled
    - add regression coverage for the new deprecation notice
    
    # Next steps
    
    A follow-up PR can remove the legacy notify path entirely once we are
    ready for the clean drop. Before then, we can watch
    `codex.notify.configured` and `codex.notify.run` to understand the
    deprecation impact and remaining active usage. The cleanup PR should
    then delete the `notify` config field, the `legacy_notify`
    implementation, the old compatibility dispatch types and callsites that
    only exist for the legacy path, and the remaining compatibility
    docs/tests.
    
    # Testing
    
    - `cargo test -p codex-hooks`
    - `cargo test -p codex-config`
    - `cargo test -p codex-core emits_deprecation_notice_for_notify`
  • install WFP filters for Windows sandbox setup (#20101)
    ## Summary
    
    This PR installs a first wave of WFP (Windows Filtering Platform)
    filters that reduce the surface area of network egress vulnerabilities
    for the Windows Sandbox.
    
    - Add persistent Windows Filtering Platform provider, sublayer, and
    filters for the Windows sandbox offline account.
    - Install WFP filters during elevated full setup, log failures
    non-fatally, and emit setup metrics when analytics are enabled.
    - Bump the Windows sandbox setup version so existing users rerun full
    setup and receive the new filters.
    
    ## What WFP is
    Windows Filtering Platform (WFP) is the low-level Windows networking
    policy engine underneath things like Windows Firewall. It lets
    privileged code install persistent filtering rules at specific network
    stack layers, with conditions like "only traffic from this Windows
    account" or "only this remote port," and an action like block.
    
    In this change, we create a Codex-owned persistent WFP provider and
    sublayer, then install block filters scoped to the Windows sandbox's
    offline user account via `ALE_USER_ID`. That means the filters are
    targeted at sandboxed processes running as that account, rather than
    globally affecting the host.
    
    ## Initial filter set
    We are starting with 12 concrete WFP filters across a few high-value
    bypass surfaces. The table below describes the filter families rather
    than one filter per row:
    
    | Area | Concrete filters | Purpose |
    | --- | --- | --- |
    | ICMP | 4 filters: ICMP v4/v6 on `ALE_AUTH_CONNECT` and
    `ALE_RESOURCE_ASSIGNMENT` | Block direct ping-style network reachability
    checks from the offline account. |
    | DNS | 2 filters: remote port `53` on `ALE_AUTH_CONNECT_V4/V6` | Block
    direct DNS queries that bypass our intended proxy/offline path. |
    | DNS-over-TLS | 2 filters: remote port `853` on
    `ALE_AUTH_CONNECT_V4/V6` | Block encrypted DNS attempts that could
    bypass ordinary DNS interception. |
    | SMB / NetBIOS | 4 filters: remote ports `445` and `139` on
    `ALE_AUTH_CONNECT_V4/V6` | Block Windows file-sharing/network share
    traffic from sandboxed processes. |
    
    For IPv4/IPv6 coverage, the port-based filters are installed on both
    `ALE_AUTH_CONNECT_V4` and `ALE_AUTH_CONNECT_V6`. ICMP also gets both
    connect-layer and resource-assignment-layer coverage because ICMP
    traffic is shaped differently from ordinary TCP/UDP port traffic.
    
    ## Validation
    - `cargo fmt -p codex-windows-sandbox` (completed with existing
    stable-rustfmt warnings about `imports_granularity = Item`)
    - `cargo test -p codex-windows-sandbox wfp::tests`
    - `cargo test -p codex-windows-sandbox` (fails in existing legacy
    PowerShell sandbox tests because `Microsoft.PowerShell.Utility` could
    not be loaded; WFP tests passed before that failure)
  • [codex] Add token usage to turn tracing spans (#19432)
    ## Why
    
    Slow Codex turns are easier to debug when token usage is visible in the
    trace itself, without joining against separate analytics. This adds
    token usage to existing turn-handling spans for regular user turns only.
    
    [Example
    turn](https://openai.datadoghq.com/apm/trace/9d353efa2cb5de1f4c5b93dc33c3df04?colorBy=service&graphType=flamegraph&shouldShowLegend=true&sort=time&spanID=3555541504891512675&spanViewType=metadata&traceQuery=)
    <img width="1447" height="967" alt="Screenshot 2026-04-24 at 3 03 07 PM"
    src="https://github.com/user-attachments/assets/ab7bb187-e7fc-41f0-a366-6c44610b2b2c"
    />
    
    ## What Changed
    
    Added response-level token fields on completed handle_responses spans:
    
    gen_ai.usage.input_tokens
    gen_ai.usage.cache_read.input_tokens
    gen_ai.usage.output_tokens
    codex.usage.reasoning_output_tokens
    codex.usage.total_tokens
    Added aggregate token fields on regular turn spans:
    
    codex.turn.token_usage.*
    Added an explicit regular-turn opt-in via
    SessionTask::records_turn_token_usage_on_span() so this is not coupled
    to span-name strings.
    
    ## Testing
    
    - `cargo test -p codex-otel`
    - `cargo test -p codex-core
    turn_and_completed_response_spans_record_token_usage`
    - `just fmt`
    - `just fix -p codex-core`
    - `just fix -p codex-otel`
    - Manual local Electron/app-server smoke test: regular user turn emits
    the new span fields
    
    Known status: `cargo test -p codex-core` was attempted and failed in
    unrelated existing areas: config approvals, request-permissions,
    git-info ordering, and subagent metadata persistence.
  • Remove ghost snapshots (#19481)
    ## Summary
    - Remove `ghost_snapshot` / `GhostCommit` from the Responses API surface
    and generated SDK/schema artifacts.
    - Keep legacy config loading compatible, but make undo a no-op that
    reports the feature is unavailable.
    - Clean up core history, compaction, telemetry, rollout, and tests to
    stop carrying ghost snapshot items.
    
    ## Testing
    - Unit tests passed for `codex-protocol`, `codex-core` targeted undo and
    compaction flows, `codex-rollout`, and `codex-app-server-protocol`.
    - Regenerated config and app-server schemas plus Python SDK artifacts
    and verified they match the checked-in outputs.
  • Add safety check notification and error handling (#19055)
    Adds a new app-server notification that fires when a user account has
    been flagged for potential safety reasons.
  • feat: Fairly trim skill descriptions within context budget (#18925)
    Preserve skill name/path entries whenever possible and trim descriptions
    first, using round-robin character allocation so short descriptions do
    not waste budget.
  • feat: add explicit AgentIdentity auth mode (#18785)
    ## Summary
    
    This PR adds `CodexAuth::AgentIdentity` as an explicit auth mode.
    
    An AgentIdentity auth record is a standalone `auth.json` mode. When
    `AuthManager::auth().await` loads that mode, it registers one
    process-scoped task and stores it in runtime-only state on the auth
    value. Header creation stays synchronous after that because the task is
    initialized before callers receive the auth object.
    
    This PR also removes the old feature flag path. AgentIdentity is
    selected by explicit auth mode, not by a hidden flag or lazy mutation of
    ChatGPT auth records.
    
    Reference old stack: https://github.com/openai/codex/pull/17387/changes
    
    ## Design Decisions
    
    - AgentIdentity is a real auth enum variant because it can be the only
    credential in `auth.json`.
    - The process task is ephemeral runtime state. It is not serialized and
    is not stored in rollout/session data.
    - Account/user metadata needed by existing Codex backend checks lives on
    the AgentIdentity record for now.
    - `is_chatgpt_auth()` remains token-specific.
    - `uses_codex_backend()` is the broader predicate for ChatGPT-token auth
    and AgentIdentity auth.
    
    ## Stack
    
    1. https://github.com/openai/codex/pull/18757: full revert
    2. https://github.com/openai/codex/pull/18871: isolated Agent Identity
    crate
    3. This PR: explicit AgentIdentity auth mode and startup task allocation
    4. https://github.com/openai/codex/pull/18811: migrate Codex backend
    auth callsites through AuthProvider
    5. https://github.com/openai/codex/pull/18904: accept AgentIdentity JWTs
    and load `CODEX_AGENT_IDENTITY`
    
    ## Testing
    
    Tests: targeted Rust checks, cargo-shear, Bazel lock check, and CI.
  • feat: add metric to track the number of turns with memory usage (#18662)
    Add a metric `codex.turn.memory` to know if a turn used memories or not.
    This is not part of the other turn metrics as a label to limit
    cardinality
  • feat: Budget skill metadata and surface trimming as a warning (#18298)
    Cap the model-visible skills section to a small share of the context
    window, with a fallback character budget, and keep only as many implicit
    skills as fit within that budget.
    
    Emit a non-fatal warning when enabled skills are omitted, and add a new
    app-server warning notification
    
    Record thread-start skill metrics for total enabled skills, kept skills,
    and whether truncation happened
    
    ---------
    
    Co-authored-by: Matthew Zeng <mzeng@openai.com>
    Co-authored-by: Codex <noreply@openai.com>
  • Stream apply_patch changes (#17862)
    Adds new events for streaming apply_patch changes from responses api.
    This is to enable clients to show progress during file writes.
    
    Caveat: This does not work with apply_patch in function call mode, since
    that required adding streaming json parsing.
  • Add OTEL metrics for hook runs (#18026)
    # Why
    We already emit analytics for completed hook runs, but we don't have
    matching OTEL metrics to track hook volume and latency.
    
    # What
    - add `codex.hooks.run` and `codex.hooks.run.duration_ms`
    - tag both metrics with `hook_name`, `source`, and `status`
    - emit the metrics from the completed hook path
    
    Verified locally against a dummy OTLP collector
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • [codex] reduce module visibility (#16978)
    ## Summary
    - reduce public module visibility across Rust crates, preferring private
    or crate-private modules with explicit crate-root public exports
    - update external call sites and tests to use the intended public crate
    APIs instead of reaching through module trees
    - add the module visibility guideline to AGENTS.md
    
    ## Validation
    - `cargo check --workspace --all-targets --message-format=short` passed
    before the final fix/format pass
    - `just fix` completed successfully
    - `just fmt` completed successfully
    - `git diff --check` passed
  • Codex/windows bazel rust test coverage no rs (#16528)
    # Why this PR exists
    
    This PR is trying to fix a coverage gap in the Windows Bazel Rust test
    lane.
    
    Before this change, the Windows `bazel test //...` job was nominally
    part of PR CI, but a non-trivial set of `//codex-rs/...` Rust test
    targets did not actually contribute test signal on Windows. In
    particular, targets such as `//codex-rs/core:core-unit-tests`,
    `//codex-rs/core:core-all-test`, and `//codex-rs/login:login-unit-tests`
    were incompatible during Bazel analysis on the Windows gnullvm platform,
    so they never reached test execution there. That is why the
    Cargo-powered Windows CI job could surface Windows-only failures that
    the Bazel-powered job did not report: Cargo was executing those tests,
    while Bazel was silently dropping them from the runnable target set.
    
    The main goal of this PR is to make the Windows Bazel test lane execute
    those Rust test targets instead of skipping them during analysis, while
    still preserving `windows-gnullvm` as the target configuration for the
    code under test. In other words: use an MSVC host/exec toolchain where
    Bazel helper binaries and build scripts need it, but continue compiling
    the actual crate targets with the Windows gnullvm cfgs that our current
    Bazel matrix is supposed to exercise.
    
    # Important scope note
    
    This branch intentionally removes the non-resource-loading `.rs` test
    and production-code changes from the earlier
    `codex/windows-bazel-rust-test-coverage` branch. The only Rust source
    changes kept here are runfiles/resource-loading fixes in TUI tests:
    
    - `codex-rs/tui/src/chatwidget/tests.rs`
    - `codex-rs/tui/tests/manager_dependency_regression.rs`
    
    That is deliberate. Since the corresponding tests already pass under
    Cargo, this PR is meant to test whether Bazel infrastructure/toolchain
    fixes alone are enough to get a healthy Windows Bazel test signal,
    without changing test behavior for Windows timing, shell output, or
    SQLite file-locking.
    
    # How this PR changes the Windows Bazel setup
    
    ## 1. Split Windows host/exec and target concerns in the Bazel test lane
    
    The core change is that the Windows Bazel test job now opts into an MSVC
    host platform for Bazel execution-time tools, but only for `bazel test`,
    not for the Bazel clippy build.
    
    Files:
    
    - `.github/workflows/bazel.yml`
    - `.github/scripts/run-bazel-ci.sh`
    - `MODULE.bazel`
    
    What changed:
    
    - `run-bazel-ci.sh` now accepts `--windows-msvc-host-platform`.
    - When that flag is present on Windows, the wrapper appends
    `--host_platform=//:local_windows_msvc` unless the caller already
    provided an explicit `--host_platform`.
    - `bazel.yml` passes that wrapper flag only for the Windows `bazel test
    //...` job.
    - The Bazel clippy job intentionally does **not** pass that flag, so
    clippy stays on the default Windows gnullvm host/exec path and continues
    linting against the target cfgs we care about.
    - `run-bazel-ci.sh` also now forwards `CODEX_JS_REPL_NODE_PATH` on
    Windows and normalizes the `node` executable path with `cygpath -w`, so
    tests that need Node resolve the runner's Node installation correctly
    under the Windows Bazel test environment.
    
    Why this helps:
    
    - The original incompatibility chain was mostly on the **exec/tool**
    side of the graph, not in the Rust test code itself. Moving host tools
    to MSVC lets Bazel resolve helper binaries and generators that were not
    viable on the gnullvm exec platform.
    - Keeping the target platform on gnullvm preserves cfg coverage for the
    crates under test, which is important because some Windows behavior
    differs between `msvc` and `gnullvm`.
    
    ## 2. Teach the repo's Bazel Rust macro about Windows link flags and
    integration-test knobs
    
    Files:
    
    - `defs.bzl`
    - `codex-rs/core/BUILD.bazel`
    - `codex-rs/otel/BUILD.bazel`
    - `codex-rs/tui/BUILD.bazel`
    
    What changed:
    
    - Replaced the old gnullvm-only linker flag block with
    `WINDOWS_RUSTC_LINK_FLAGS`, which now handles both Windows ABIs:
      - gnullvm gets `-C link-arg=-Wl,--stack,8388608`
    - MSVC gets `-C link-arg=/STACK:8388608`, `-C
    link-arg=/NODEFAULTLIB:libucrt.lib`, and `-C link-arg=ucrt.lib`
    - Threaded those Windows link flags into generated `rust_binary`,
    unit-test binaries, and integration-test binaries.
    - Extended `codex_rust_crate(...)` with:
      - `integration_test_args`
      - `integration_test_timeout`
    - Used those new knobs to:
    - mark `//codex-rs/core:core-all-test` as a long-running integration
    test
      - serialize `//codex-rs/otel:otel-all-test` with `--test-threads=1`
    - Added `src/**/*.rs` to `codex-rs/tui` test runfiles, because one
    regression test scans source files at runtime and Bazel does not expose
    source-tree directories unless they are declared as data.
    
    Why this helps:
    
    - Once host-side MSVC tools are available, we still need the generated
    Rust test binaries to link correctly on Windows. The MSVC-side
    stack/UCRT flags make those binaries behave more like their Cargo-built
    equivalents.
    - The integration-test macro knobs avoid hardcoding one-off test
    behavior in ad hoc BUILD rules and make the generated test targets more
    expressive where Bazel and Cargo have different runtime defaults.
    
    ## 3. Patch `rules_rs` / `rules_rust` so Windows MSVC exec-side Rust and
    build scripts are actually usable
    
    Files:
    
    - `MODULE.bazel`
    - `patches/rules_rs_windows_exec_linker.patch`
    - `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch`
    - `patches/rules_rust_windows_build_script_runner_paths.patch`
    - `patches/rules_rust_windows_exec_msvc_build_script_env.patch`
    - `patches/rules_rust_windows_msvc_direct_link_args.patch`
    - `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch`
    - `patches/BUILD.bazel`
    
    What these patches do:
    
    - `rules_rs_windows_exec_linker.patch`
    - Adds a `rust-lld` filegroup for Windows Rust toolchain repos,
    symlinked to `lld-link.exe` from `PATH`.
      - Marks Windows toolchains as using a direct linker driver.
      - Supplies Windows stdlib link flags for both gnullvm and MSVC.
    - `rules_rust_windows_bootstrap_process_wrapper_linker.patch`
    - For Windows MSVC Rust targets, prefers the Rust toolchain linker over
    an inherited C++ linker path like `clang++`.
    - This specifically avoids the broken mixed-mode command line where
    rustc emits MSVC-style `/NOLOGO` / `/LIBPATH:` / `/OUT:` arguments but
    Bazel still invokes `clang++.exe`.
    - `rules_rust_windows_build_script_runner_paths.patch`
    - Normalizes forward-slash execroot-relative paths into Windows path
    separators before joining them on Windows.
    - Uses short Windows paths for `RUSTC`, `OUT_DIR`, and the build-script
    working directory to avoid path-length and quoting issues in third-party
    build scripts.
    - Exposes `RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER=1` to build scripts so
    crate-local patches can detect "this is running under Bazel's
    build-script runner".
    - Fixes the Windows runfiles cleanup filter so generated files with
    retained suffixes are actually retained.
    - `rules_rust_windows_exec_msvc_build_script_env.patch`
    - For exec-side Windows MSVC build scripts, stops force-injecting
    Bazel's `CC`, `CXX`, `LD`, `CFLAGS`, and `CXXFLAGS` when that would send
    GNU-flavored tool paths/flags into MSVC-oriented Cargo build scripts.
    - Rewrites or strips GNU-only `--sysroot`, MinGW include/library paths,
    stack-protector, and `_FORTIFY_SOURCE` flags on the MSVC exec path.
    - The practical effect is that build scripts can fall back to the Visual
    Studio toolchain environment already exported by CI instead of crashing
    inside Bazel's hermetic `clang.exe` setup.
    - `rules_rust_windows_msvc_direct_link_args.patch`
    - When using a direct linker on Windows, stops forwarding GNU driver
    flags such as `-L...` and `--sysroot=...` that `lld-link.exe` does not
    understand.
    - Passes non-`.lib` native artifacts as explicit `-Clink-arg=<path>`
    entries when needed.
    - Filters C++ runtime libraries to `.lib` artifacts on the Windows
    direct-driver path.
    - `rules_rust_windows_process_wrapper_skip_temp_outputs.patch`
    - Excludes transient `*.tmp*` and `*.rcgu.o` files from process-wrapper
    dependency search-path consolidation, so unstable compiler outputs do
    not get treated as real link search-path inputs.
    
    Why this helps:
    
    - The host-platform split alone was not enough. Once Bazel started
    analyzing/running previously incompatible Rust tests on Windows, the
    next failures were in toolchain plumbing:
    - MSVC-targeted Rust tests were being linked through `clang++` with
    MSVC-style arguments.
    - Cargo build scripts running under Bazel's Windows MSVC exec platform
    were handed Unix/GNU-flavored path and flag shapes.
    - Some generated paths were too long or had path-separator forms that
    third-party Windows build scripts did not tolerate.
    - These patches make that mixed Bazel/Cargo/Rust/MSVC path workable
    enough for the test lane to actually build and run the affected crates.
    
    ## 4. Patch third-party crate build scripts that were not robust under
    Bazel's Windows MSVC build-script path
    
    Files:
    
    - `MODULE.bazel`
    - `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch`
    - `patches/ring_windows_msvc_include_dirs.patch`
    - `patches/zstd-sys_windows_msvc_include_dirs.patch`
    
    What changed:
    
    - `aws-lc-sys`
    - Detects Bazel's Windows MSVC build-script runner via
    `RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER` or a `bazel-out` manifest-dir
    path.
    - Uses `clang-cl` for Bazel Windows MSVC builds when no explicit
    `CC`/`CXX` is set.
    - Allows prebuilt NASM on the Bazel Windows MSVC path even when `nasm`
    is not available directly in the runner environment.
    - Avoids canonicalizing `CARGO_MANIFEST_DIR` in the Bazel Windows MSVC
    case, because that path may point into Bazel output/runfiles state where
    preserving the given path is more reliable than forcing a local
    filesystem canonicalization.
    - `ring`
    - Under the Bazel Windows MSVC build-script runner, copies the
    pregenerated source tree into `OUT_DIR` and uses that as the
    generated-source root.
    - Adds include paths needed by MSVC compilation for
    Fiat/curve25519/P-256 generated headers.
    - Rewrites a few relative includes in C sources so the added include
    directories are sufficient.
    - `zstd-sys`
    - Adds MSVC-only include directories for `compress`, `decompress`, and
    feature-gated dictionary/legacy/seekable sources.
    - Skips `-fvisibility=hidden` on MSVC targets, where that
    GCC/Clang-style flag is not the right mechanism.
    
    Why this helps:
    
    - After the `rules_rust` plumbing started running build scripts on the
    Windows MSVC exec path, some third-party crates still failed for
    crate-local reasons: wrong compiler choice, missing include directories,
    build-script assumptions about manifest paths, or Unix-only C compiler
    flags.
    - These crate patches address those crate-local assumptions so the
    larger toolchain change can actually reach first-party Rust test
    execution.
    
    ## 5. Keep the only `.rs` test changes to Bazel/Cargo runfiles parity
    
    Files:
    
    - `codex-rs/tui/src/chatwidget/tests.rs`
    - `codex-rs/tui/tests/manager_dependency_regression.rs`
    
    What changed:
    
    - Instead of asking `find_resource!` for a directory runfile like
    `src/chatwidget/snapshots` or `src`, these tests now resolve one known
    file runfile first and then walk to its parent directory.
    
    Why this helps:
    
    - Bazel runfiles are more reliable for explicitly declared files than
    for source-tree directories that happen to exist in a Cargo checkout.
    - This keeps the tests working under both Cargo and Bazel without
    changing their actual assertions.
    
    # What we tried before landing on this shape, and why those attempts did
    not work
    
    ## Attempt 1: Force `--host_platform=//:local_windows_msvc` for all
    Windows Bazel jobs
    
    This did make the previously incompatible test targets show up during
    analysis, but it also pushed the Bazel clippy job and some unrelated
    build actions onto the MSVC exec path.
    
    Why that was bad:
    
    - Windows clippy started running third-party Cargo build scripts with
    Bazel's MSVC exec settings and crashed in crates such as `tree-sitter`
    and `libsqlite3-sys`.
    - That was a regression in a job that was previously giving useful
    gnullvm-targeted lint signal.
    
    What this PR does instead:
    
    - The wrapper flag is opt-in, and `bazel.yml` uses it only for the
    Windows `bazel test` lane.
    - The clippy lane stays on the default Windows gnullvm host/exec
    configuration.
    
    ## Attempt 2: Broaden the `rules_rust` linker override to all Windows
    Rust actions
    
    This fixed the MSVC test-lane failure where normal `rust_test` targets
    were linked through `clang++` with MSVC-style arguments, but it broke
    the default gnullvm path.
    
    Why that was bad:
    
    -
    `@@rules_rs++rules_rust+rules_rust//util/process_wrapper:process_wrapper`
    on the gnullvm exec platform started linking with `lld-link.exe` and
    then failed to resolve MinGW-style libraries such as `-lkernel32`,
    `-luser32`, and `-lmingw32`.
    
    What this PR does instead:
    
    - The linker override is restricted to Windows MSVC targets only.
    - The gnullvm path keeps its original linker behavior, while MSVC uses
    the direct Windows linker.
    
    ## Attempt 3: Keep everything on pure Windows gnullvm and patch the V8 /
    Python incompatibility chain instead
    
    This would have preserved a single Windows ABI everywhere, but it is a
    much larger project than this PR.
    
    Why that was not the practical first step:
    
    - The original incompatibility chain ran through exec-side generators
    and helper tools, not only through crate code.
    - `third_party/v8` is already special-cased on Windows gnullvm because
    `rusty_v8` only publishes Windows prebuilts under MSVC names.
    - Fixing that path likely means deeper changes in
    V8/rules_python/rules_rust toolchain resolution and generator execution,
    not just one local CI flag.
    
    What this PR does instead:
    
    - Keep gnullvm for the target cfgs we want to exercise.
    - Move only the Windows test lane's host/exec platform to MSVC, then
    patch the build-script/linker boundary enough for that split
    configuration to work.
    
    ## Attempt 4: Validate compatibility with `bazel test --nobuild ...`
    
    This turned out to be a misleading local validation command.
    
    Why:
    
    - `bazel test --nobuild ...` can successfully analyze targets and then
    still exit 1 with "Couldn't start the build. Unable to run tests"
    because there are no runnable test actions after `--nobuild`.
    
    Better local check:
    
    ```powershell
    bazel build --nobuild --keep_going --host_platform=//:local_windows_msvc //codex-rs/login:login-unit-tests //codex-rs/core:core-unit-tests //codex-rs/core:core-all-test
    ```
    
    # Which patches probably deserve upstream follow-up
    
    My rough take is that the `rules_rs` / `rules_rust` patches are the
    highest-value upstream candidates, because they are fixing generic
    Windows host/exec + MSVC direct-linker behavior rather than
    Codex-specific test logic.
    
    Strong upstream candidates:
    
    - `patches/rules_rs_windows_exec_linker.patch`
    - `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch`
    - `patches/rules_rust_windows_build_script_runner_paths.patch`
    - `patches/rules_rust_windows_exec_msvc_build_script_env.patch`
    - `patches/rules_rust_windows_msvc_direct_link_args.patch`
    - `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch`
    
    Why these seem upstreamable:
    
    - They address general-purpose problems in the Windows MSVC exec path:
      - missing direct-linker exposure for Rust toolchains
      - wrong linker selection when rustc emits MSVC-style args
    - Windows path normalization/short-path issues in the build-script
    runner
      - forwarding GNU-flavored CC/link flags into MSVC Cargo build scripts
      - unstable temp outputs polluting process-wrapper search-path state
    
    Potentially upstreamable crate patches, but likely with more care:
    
    - `patches/zstd-sys_windows_msvc_include_dirs.patch`
    - `patches/ring_windows_msvc_include_dirs.patch`
    - `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch`
    
    Notes on those:
    
    - The `zstd-sys` and `ring` include-path fixes look fairly generic for
    MSVC/Bazel build-script environments and may be straightforward to
    propose upstream after we confirm CI stability.
    - The `aws-lc-sys` patch is useful, but it includes a Bazel-specific
    environment probe and CI-specific compiler fallback behavior. That
    probably needs a cleaner upstream-facing shape before sending it out, so
    upstream maintainers are not forced to adopt Codex's exact CI
    assumptions.
    
    Probably not worth upstreaming as-is:
    
    - The repo-local Starlark/test target changes in `defs.bzl`,
    `codex-rs/*/BUILD.bazel`, and `.github/scripts/run-bazel-ci.sh` are
    mostly Codex-specific policy and CI wiring, not generic rules changes.
    
    # Validation notes for reviewers
    
    On this branch, I ran the following local checks after dropping the
    non-resource-loading Rust edits:
    
    ```powershell
    cargo test -p codex-tui
    just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc -- fix -p codex-tui
    python .\tools\argument-comment-lint\run-prebuilt-linter.py -p codex-tui
    just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc fmt
    ```
    
    One local caveat:
    
    - `just argument-comment-lint` still fails on this Windows machine for
    an unrelated Bazel toolchain-resolution issue in
    `//codex-rs/exec:exec-all-test`, so I used the direct prebuilt linter
    for `codex-tui` as the local fallback.
    
    # Expected reviewer takeaway
    
    If this PR goes green, the important conclusion is that the Windows
    Bazel test coverage gap was primarily a Bazel host/exec toolchain
    problem, not a need to make the Rust tests themselves Windows-specific.
    That would be a strong signal that the deleted non-resource-loading Rust
    test edits from the earlier branch should stay out, and that future work
    should focus on upstreaming the generic `rules_rs` / `rules_rust`
    Windows fixes and reducing the crate-local patch surface.
  • otel: remove the last workspace crate feature (#16469)
    ## Why
    
    `codex-otel` still carried `disable-default-metrics-exporter`, which was
    the last remaining workspace crate feature.
    
    We are removing workspace crate features because they do not fit our
    current build model well:
    
    - our Bazel setup does not honor crate features today, which can let
    feature-gated issues go unnoticed
    - they create extra crate build permutations that we want to avoid
    
    For this case, the feature was only being used to keep the built-in
    Statsig metrics exporter off in test and debug-oriented contexts. This
    repo already treats `debug_assertions` as the practical proxy for that
    class of behavior, so OTEL should follow the same convention instead of
    keeping a dedicated crate feature alive.
    
    ## What changed
    
    - removed `disable-default-metrics-exporter` from
    `codex-rs/otel/Cargo.toml`
    - removed the `codex-otel` dev-dependency feature activation from
    `codex-rs/core/Cargo.toml`
    - changed `codex-rs/otel/src/config.rs` so the built-in
    `OtelExporter::Statsig` default resolves to `None` when
    `debug_assertions` is enabled, with a focused unit test covering that
    behavior
    - removed the final feature exceptions from
    `.github/scripts/verify_cargo_workspace_manifests.py`, so workspace
    crate features are now hard-banned instead of temporarily allowlisted
    - expanded the verifier error message to explain the Bazel mismatch and
    build-permutation cost behind that policy
    
    ## How tested
    
    - `python3 .github/scripts/verify_cargo_workspace_manifests.py`
    - `cargo test -p codex-otel`
    - `cargo test -p codex-core
    metrics_exporter_defaults_to_statsig_when_missing`
    - `cargo test -p codex-app-server app_server_default_analytics_`
    - `just bazel-lock-check`
  • chore: clean up argument-comment lint and roll out all-target CI on macOS (#16054)
    ## Why
    
    `argument-comment-lint` was green in CI even though the repo still had
    many uncommented literal arguments. The main gap was target coverage:
    the repo wrapper did not force Cargo to inspect test-only call sites, so
    examples like the `latest_session_lookup_params(true, ...)` tests in
    `codex-rs/tui_app_server/src/lib.rs` never entered the blocking CI path.
    
    This change cleans up the existing backlog, makes the default repo lint
    path cover all Cargo targets, and starts rolling that stricter CI
    enforcement out on the platform where it is currently validated.
    
    ## What changed
    
    - mechanically fixed existing `argument-comment-lint` violations across
    the `codex-rs` workspace, including tests, examples, and benches
    - updated `tools/argument-comment-lint/run-prebuilt-linter.sh` and
    `tools/argument-comment-lint/run.sh` so non-`--fix` runs default to
    `--all-targets` unless the caller explicitly narrows the target set
    - fixed both wrappers so forwarded cargo arguments after `--` are
    preserved with a single separator
    - documented the new default behavior in
    `tools/argument-comment-lint/README.md`
    - updated `rust-ci` so the macOS lint lane keeps the plain wrapper
    invocation and therefore enforces `--all-targets`, while Linux and
    Windows temporarily pass `-- --lib --bins`
    
    That temporary CI split keeps the stricter all-targets check where it is
    already cleaned up, while leaving room to finish the remaining Linux-
    and Windows-specific target-gated cleanup before enabling
    `--all-targets` on those runners. The Linux and Windows failures on the
    intermediate revision were caused by the wrapper forwarding bug, not by
    additional lint findings in those lanes.
    
    ## Validation
    
    - `bash -n tools/argument-comment-lint/run.sh`
    - `bash -n tools/argument-comment-lint/run-prebuilt-linter.sh`
    - shell-level wrapper forwarding check for `-- --lib --bins`
    - shell-level wrapper forwarding check for `-- --tests`
    - `just argument-comment-lint`
    - `cargo test` in `tools/argument-comment-lint`
    - `cargo test -p codex-terminal-detection`
    
    ## Follow-up
    
    - Clean up remaining Linux-only target-gated callsites, then switch the
    Linux lint lane back to the plain wrapper invocation.
    - Clean up remaining Windows-only target-gated callsites, then switch
    the Windows lint lane back to the plain wrapper invocation.