7919 Commits

  • [codex] disable Nagle on Rendezvous WebSockets (#30269)
    ## Summary
    
    Disable Nagle unconditionally for both exec-server Rendezvous WebSocket
    connections.
    
    - pass `disable_nagle=true` at the executor and harness connection call
    sites
    - keep the existing signed URL, protocol, and connection flow unchanged
    - add no feature flag, rollout schema, path variant, or
    experiment-specific telemetry
    
    The companion internal PR enables `TCP_NODELAY` on accepted Rendezvous
    sockets: https://github.com/openai/openai/pull/1082463
    
    ## Why
    
    Rendezvous carries small, latency-sensitive relay and JSON-RPC frames.
    Three staging runs of 30 steady-state `process/read` calls per
    configuration measured p50 improving from 139.1 ms to 81.5 ms and p95
    from 162.0 ms to 95.8 ms with Nagle disabled.
    
    The expected packet overhead is small at the current connection scale.
    We will use existing latency, error, packet, and CPU monitoring and
    revert normally if production regresses.
    
    ## Rollout and rollback
    
    The client and accepted-socket changes can deploy independently. New
    connections receive the setting as each side deploys. Rollback is a
    normal code revert; there is no persisted assignment or gate state to
    unwind.
    
    ## Validation
    
    - `just test -p codex-exec-server --lib`: 164 passed
    - `just fix -p codex-exec-server`: passed
    - `just fmt`: passed
    - independent final review found no actionable issue
  • [codex] auto-label AWS Bedrock issues (#30607)
    ## Summary
    
    AWS Bedrock issues currently fall under broader labels, which makes
    provider-specific reports harder to find. The issue tracker now has an
    `aws-bedrock` label, but the automated labeler does not know to apply
    it.
    
    Teach the issue labeler to select `aws-bedrock` for Amazon Bedrock
    provider or Bedrock Mantle issues while excluding generic AWS
    references.
  • Update safety check links (#30491)
    ## Summary
    
    Bio/Cyber safety surfaces in the TUI could send users to stale Trusted
    Access pages, and safety buffering did not always expose the Help
    Center.
    
    This follow-up to #30317 adds the missing Learn more action, refreshes
    the Bio access URL and block copy, and updates the affected snapshots
    while preserving the existing retry and wait behavior.
  • [codex] Treat max as a first-class reasoning effort (#30467)
    ## Why
    
    The Bedrock GPT-5.6 catalog advertises `max`, but Codex treated it as an
    opaque custom effort. That made the reasoning picker render it as
    lowercase `max` while known efforts use productized labels.
    
    Making `max` a known effort aligns catalog data, parsing, and UI
    presentation without changing the `max` wire value or persisted
    representation.
    
    ## What changed
    
    - Add first-class `ReasoningEffort::Max` parsing and serialization.
    - Use the typed effort in the Bedrock catalog and render it as `Max` in
    the TUI.
    - Preserve forward-compatible custom-effort coverage with a genuinely
    unknown `future` value.
    
    ### Before
    <img width="559" height="124" alt="Screenshot 2026-06-28 at 12 08 47 PM"
    src="https://github.com/user-attachments/assets/7c43cf4f-020b-4605-9239-0a9c97eb7364"
    />
    
    ### After
    <img width="558" height="107" alt="Screenshot 2026-06-28 at 12 09 10 PM"
    src="https://github.com/user-attachments/assets/b9cc5ded-c940-43b4-b024-bba25abe0a17"
    />
  • [codex] Restore v1 delegation guidance (#30511)
    ## Summary
    
    - restore the v1 clarification that requests for depth, research, or
    investigation do not authorize subagent spawning
    - restore guidance for keeping critical-path, urgent, tightly coupled,
    or difficult work local
    - update the focused v1 tool-search and spawn-description coverage
    
    ## Why
    
    PR #27919 simplified the v1 `spawn_agent` prompt by removing its
    delegation decision guidance. That left the authorization rule intact,
    but removed the instructions that constrained what should be delegated
    after spawning was authorized.
    
    Restore those guardrails while preserving later support for explicit
    delegation authorization from applicable AGENTS.md and skill
    instructions. Multi-agent v2 prompts are unchanged.
    
    ## User impact
    
    Models using the v1 multi-agent tool surface receive clearer guidance to
    delegate independent side work while keeping blocking work on the main
    rollout.
    
    ## Validation
    
    - `just fmt`
    - `git diff --check`
    - tests not run locally per repository guidance; CI will validate the
    focused coverage
  • [codex] Use model metadata for skills usage instructions (#29740)
    ## Summary
    
    - add a false-by-default `include_skills_usage_instructions` model
    metadata field
    - enable the field for the bundled `gpt-5.5` model metadata
    - consume the metadata in both core and extension skill rendering
    - remove hardcoded legacy-model matching and its marker plumbing
  • fix(tui): clear completed safety buffering prompt (#30490)
    ## Why
    
    The safety-buffering prompt is a modal TUI view, but the normal
    successful-turn path only hid the running status indicator. If the turn
    completed while the prompt was open, the stale modal remained over the
    composer until the user dismissed it or another turn started.
    
    This aligns the TUI with the app behavior: keep the safety notice
    visible while the turn is active, then remove it when the turn becomes
    terminal. It also prevents the stale retry action from changing the
    model and reasoning effort for a future turn after the buffered turn has
    already completed.
    
    | New copy |
    |---|
    | <img width="1014" height="313" alt="CleanShot 2026-06-28 at 20 27 18"
    src="https://github.com/user-attachments/assets/f0f37359-5d77-442f-add2-9d1874bdc422"
    /> |
    
    ## What changed
    
    - Clear the active safety-buffering view and retry state when a turn
    completes successfully.
    - Update the retry-capable message to say “Hang tight or retry with a
    faster model”.
    - Extend the safety-buffering regression coverage to verify that the
    prompt remains visible after assistant output starts and disappears when
    the turn completes.
    - Update the TUI snapshot for the revised copy.
    
    This is a follow-up to #29919.
    
    ## How to Test
    
    1. Start a TUI turn that receives `model/safetyBuffering/updated` with
    `showBufferingUi: true` and a `fasterModel`.
    2. Confirm the prompt says “Hang tight or retry with a faster model”.
    3. Let the turn continue and confirm the prompt remains visible while
    the turn is active.
    4. Let the turn finish successfully and confirm the prompt disappears
    and the composer is restored without requiring an extra keypress.
    5. Confirm a buffering update without a faster model still shows the
    shorter non-retry message.
    
    Targeted automated coverage:
    
    - `just test -p codex-tui safety_buffering` — 4 passed.
    - `just test -p codex-tui` — 2,951 passed; two unrelated Guardian
    feature-flag tests failed identically on `main` in this environment.
    
    The argument-comment lint was also audited manually. The workspace Bazel
    invocation was blocked by a missing external LLVM `compiler-rt` BUILD
    file, and the packaged per-crate fallback uses a nightly older than the
    current `sqlx` minimum Rust version.
  • [codex] Enable remote plugins by default (#30297)
    ## Summary
    
    - enable the remote plugin feature by default
    - promote the remote plugin feature from under development to stable
    - preserve the existing `features.remote_plugin` override for explicitly
    disabling it
    - keep legacy disabled-path coverage explicit in TUI and app-server
    tests
    
    ## Impact
    
    Remote plugin functionality is enabled by default for configurations
    that do not set the feature flag. The existing Codex backend
    authentication gate still applies.
    
    ## Validation
    
    - `just fmt`
    - `just test -p codex-features`
    - `just test -p codex-tui
    plugins_popup_remote_section_fallback_states_snapshot`
    - targeted `codex-app-server` plugin-list and skills-list tests
    - `git diff --check`
    
    The full TUI and app-server suites were also exercised locally. All
    remote-plugin-related coverage passed; unrelated local
    sandbox/test-binary failures remain outside this change.
  • [app-server] increase currentTime/read timeout (#30384)
    ## Summary
    
    Increase the external currentTime/read request timeout from 5 seconds to
    10 seconds.
    
    ## Validation
    
    - just fmt
    - Focused app-server test build was stopped to defer validation to CI.
  • [plugins] Enforce marketplace source policy at runtime (#29691)
    ## Summary
    
    - project effective marketplace/plugin config through the enterprise
    source policy so blocked installed plugins become inactive
    - filter plugin list/read/discovery and CLI marketplace source/snapshot
    reporting using the same policy
    - enforce source admission for background marketplace cache refreshes
    - continue refreshing/upgrading independent marketplaces and plugins
    when one entry fails, returning per-entry errors
    - include policy-projected plugin state in cache and refresh keys so
    requirement changes invalidate stale results
    
    ## Stack
    
    This is PR 2 of 2 and is based on #29690. Review the admission model and
    source matcher in #29690 first; this PR contains only runtime
    enforcement.
    
    ## Test plan
    
    - `just test -p codex-core-plugins` (287 tests)
    - `just test -p codex-cli
    plugin_list_ignores_implicit_system_marketplace_roots_without_manifests`
    - `cargo check -p codex-cli -p codex-app-server --tests`
  • [app-server] expose environment info RPC (#30291)
    ## Why
    
    App-server clients that configure named execution environments need to
    discover an environment's shell and working directory before selecting
    it for a thread or turn. Because the environment can run on a different
    operating system than app-server, its working directory is represented
    as a canonical `file:` URI rather than a host-local path string. The
    probe also needs a bounded response time: an exec-server that completes
    initialization but never answers `environment/info` must not hold the
    environment serialization queue indefinitely.
    
    ## What changed
    
    - Add an experimental `environment/info` app-server RPC for named
    environments.
    - Route the probe through the managed environment connection and return
    target-native shell metadata plus the default working directory as a
    `PathUri`.
    - Return connection and protocol failures as JSON-RPC errors.
    - Bound the exec-server probe response to 30 seconds and remove
    timed-out calls from the pending-request table so later environment
    mutations can proceed.
    - Cover successful responses, omitted working directories, unknown
    environments, connection failures, and pending-call cleanup.
    
    ## Protocol examples
    
    Request:
    
    ```json
    {
      "id": 42,
      "method": "environment/info",
      "params": {
        "environmentId": "remote-a"
      }
    }
    ```
    
    Successful response:
    
    ```json
    {
      "id": 42,
      "result": {
        "shell": {
          "name": "zsh",
          "path": "/bin/zsh"
        },
        "cwd": "file:///workspace"
      }
    }
    ```
    
    If the exec-server initializes but does not answer the probe within 30
    seconds:
    
    ```json
    {
      "id": 42,
      "error": {
        "code": -32603,
        "message": "failed to get info for environment `remote-a`: exec-server protocol error: timed out waiting for exec-server `environment/info` response after 30s"
      }
    }
    ```
    
    ## Testing
    
    - App-server integration coverage for successful info (including omitted
    `cwd`), unknown environments, and connection failures.
    - Exec-server RPC coverage verifying a timed-out call is removed from
    the pending-request table.
    
    ---------
    
    Co-authored-by: Michael Bolin <mbolin@openai.com>
  • core: stabilize synthesized call output IDs (#30327)
    ## Why
    
    Response item IDs represent stable conversation identity.
    `ContextManager::for_prompt` repairs an unmatched call by synthesizing
    an `"aborted"` output in the disposable prompt projection, but that
    output previously had no ID. Assigning a fresh ID on every prompt build
    would make retries and resumes change otherwise identical model context
    and reduce prompt-cache reuse.
    
    The concrete bug is that these normalization-created outputs bypass the
    regular item-ID allocation path. Even with item IDs enabled, a prompt
    could therefore contain an identified call paired with a synthetic
    output whose `id` was missing. This change closes that gap by deriving
    the output ID from the source call's item ID. For legacy calls that have
    no item ID, the output remains ID-less because there is no stable source
    identity to derive from.
    
    The originating call already has a stable item ID under the item-ID
    model introduced in #28814. A prompt-only output can therefore derive
    stable identity from that call without mutating canonical history or
    persisted rollouts. This addresses the failure exposed by #30311 while
    keeping normalization read-only outside its detached prompt snapshot.
    
    UUIDv5 is intentional here because it is the standard namespaced,
    deterministic UUID construction. Using the output kind and source call
    ID as the name produces the same UUID on every projection while keeping
    output kinds in separate name domains. UUIDv7 would introduce randomness
    and time, so keeping it stable would require persisting the synthetic
    repair. UUIDv5 uses SHA-1 internally, but this is only an identity
    mapping—not an authenticity or security boundary.
    
    ## What changed
    
    - Derive a deterministic UUIDv5 ID for each synthesized call output from
    the source call item ID.
    - Use the Responses API prefix appropriate for function, custom-tool,
    tool-search, and local-shell outputs.
    - Preserve the existing insertion position immediately after the
    unmatched call.
    - Keep synthesized outputs prompt-only; no rollout, task-lifecycle,
    compaction, or raw-response behavior changes.
    
    ## Testing
    
    - `just test -p codex-core
    for_prompt_assigns_stable_id_to_synthetic_output_without_reordering_history`
    - `just test -p codex-core
    synthetic_call_output_id_is_stable_across_resumes`
    - `just test -p codex-core normalize_adds_missing_output`
    - `just test -p codex-core response_item_ids`
  • Preserve namespaces on custom tool calls (#30302)
    ## Summary
    
    - Preserve the optional namespace on custom tool calls during response
    deserialization and app-server replay.
    - Use the namespaced tool identifier for streaming argument handling and
    tool dispatch.
    - Regenerate app-server protocol schemas.
    - Add regression tests covering namespace serialization and routing.
    
    ## Testing
    
    - Ran affected protocol and app-server test suites.
    - Ran the full core test suite; two load-sensitive timing tests passed
    when rerun individually.
    - Ran Clippy and formatting checks.
    - Verified with a local end-to-end app-server replay that the namespace
    is preserved through the complete request/response flow.
  • app-server: structure and test JSON shutdown logs (#30314)
    ## Why
    
    `LOG_FORMAT=json` and `RUST_LOG` are supported by app-server, but the
    behavior was only covered indirectly. We should verify the actual JSONL
    written by both user-facing entry points: `codex app-server` and the
    standalone `codex-app-server` binary.
    
    The existing processor shutdown message also always said the channel
    closed, even though the processor can exit for several different
    reasons. Structured fields make that event more accurate and useful to
    log consumers.
    
    ## What changed
    
    - Record the processor `exit_reason`, remaining connection count, and
    forced-shutdown state as structured tracing fields.
    - Add a shared process-test helper that enables JSON logging, validates
    every stderr line as JSON, and verifies the top-level timestamp is RFC
    3339.
    - Cover both `codex app-server` and `codex-app-server`, asserting the
    stable `level`, `fields`, and `target` payload.
    
    ## Test plan
    
    - `just test -p codex-app-server
    standalone_app_server_emits_json_info_events`
    - `just test -p codex-cli app_server_emits_json_info_events`
  • core: overlap diff root discovery with world state (#30286)
    ## Why
    
    Remote diff-root discovery is independent of world-state construction,
    but it ran afterward and added filesystem metadata latency before the
    first model request. Overlap the independent work so thread-cold turns
    do not pay those waits serially.
    
    ## What
    
    - Run `record_context_updates_and_set_reference_context_item` and
    `turn_diff_display_roots` with `tokio::join!`.
    - Reuse the same resolved display roots when constructing
    `TurnDiffTracker`; no cache or behavior lifecycle changes are
    introduced.
    
    ## Validation
    
    A synthetic executor-skill benchmark with artificial network delay:
    thread-cold model-request p50 improved from about 1.79 s to 1.58 s.
  • [codex] consume pushed exec-server process events (#30273)
    ## Summary
    
    - complete unified-exec processes from the ordered event stream instead
    of issuing a final zero-wait `process/read`
    - add optional executor sandbox-denial state to `process/exited`
    - retain `process/read` as a retained-output and compatibility fallback
    for receiver lag, sequence gaps, and legacy servers
    - recover sandbox-denial state across transport reconnection
    - cover the real `TestCodex` remote-exec path without adding a public
    test-only event constructor
    
    ## Why
    
    A successful one-shot tool call currently receives its output and
    terminal notifications, then pays another wide-area `process/read` round
    trip before returning. Staging traces showed that remote response wait
    accounted for more than 99.8% of RPC time; local serialization,
    queueing, and deserialization were below 0.6 ms.
    
    ## Measured impact
    
    A direct staging A/B used the same build and route and changed only
    completion mode. Each arm ran three times with 30 one-shot
    `/usr/bin/true` calls per run. The table reports the median of the three
    per-run percentiles.
    
    | Metric | Final `process/read` | Pushed events | Change |
    | --- | ---: | ---: | ---: |
    | End-to-end completion p50 | 159.5 ms | 118.7 ms | -40.8 ms (-25.6%) |
    | End-to-end completion p95 | 182.4 ms | 131.7 ms | -50.6 ms (-27.8%) |
    | Completion-wait p50 | 80.1 ms | 41.5 ms | -38.5 ms (-48.1%) |
    | Final `process/read` RPC p50 | 79.9 ms | eliminated | -79.9 ms |
    
    TCP_NODELAY was enabled in both A/B arms, so its effect cancels out. The
    successful, complete, in-order event path issued zero final
    `process/read` calls.
    
    ## Compatibility and recovery
    
    - new servers send `sandboxDenied` on `process/exited`
    - legacy servers omit it, which triggers one compatibility
    `process/read`
    - broadcast lag or a sequence gap triggers a retained-output read
    - recovery remains bounded by the server's existing 1 MiB
    retained-output window
    - complete, in-order event streams issue no completion read
    - sandbox denial is attached to the exit event before consumers can
    observe process completion
    - server-first and client-first rollouts remain wire-compatible;
    server-first realizes the latency win immediately
    
    ## Integration coverage
    
    The `TestCodex` suite exercises four distinct remote-exec contracts:
    
    - complete pushed output/exit/close with zero reads
    - direct pushed sandbox denial with zero reads
    - legacy missing denial metadata with exactly one compatibility read
    - count-bounded replay eviction recovered from retained output without
    duplication
    
    ## Validation
    
    - `just test -p codex-core
    exec_command_consumes_pushed_remote_process_events`: 4 passed
    - `just test -p codex-core unified_exec::process_tests::`: 4 passed
    - `just test -p codex-exec-server`: 294 passed, 2 skipped
    - `just test -p codex-exec-server-protocol`: 5 passed
    - `just test -p codex-rmcp-client`: 89 passed, 2 skipped
    - focused Bazel `//codex-rs/core:core-all-test`: passed across 16 shards
    - scoped `just fix` passed for core and exec-server
    - `just fmt` passed
    
    The complete workspace suite was not rerun; focused Cargo and Bazel
    coverage passed for the changed behavior.
  • fix(remote-control): avoid server token refresh retry storms (#30201)
    ## Why
    
    Remote-control websocket reconnects and pairing requests proactively
    refresh their server token. When `/server/refresh` returns a transient
    error such as `502`, the still-valid token was discarded as a usable
    connection path, causing reconnect failures and repeated refresh
    attempts that could amplify an upstream incident.
    
    ## What Changed
    
    - Start proactive refresh five minutes before token expiry and
    distinguish it from a required refresh for missing or expired tokens.
    - Continue websocket and pairing operations with the existing valid
    token after `429`, `5xx`, or timeout failures.
    - Share an in-memory `next_refresh_at` throttle across websocket and
    pairing callers, honoring both `Retry-After` formats and otherwise using
    a jittered 24–36 second delay.
    - Keep required refreshes strict, preserve `404` enrollment replacement,
    and clear token/throttle state for `401` and `403` auth recovery.
    - Preserve refresh response metadata internally and add focused
    wire-level and integration coverage.
    
    ## Verification
    
    Added behavioral coverage proving that:
    
    - a valid near-expiry token still completes websocket and pairing
    requests after transient refresh failures;
    - `Retry-After` suppresses a subsequent refresh across websocket and
    pairing callers;
    - request and response-body timeouts are classified as transient;
    - an expired token, including one that expires during refresh, cannot
    proceed to websocket connection;
    - auth failures clear the attempted token without overwriting a
    concurrently rotated token.
  • feat(protocol): define missing rollout turn items (#30282)
    ## Description
    
    This PR adds canonical core `TurnItem` shapes for command execution,
    dynamic tool calls, collab agent tool calls, and sub-agent activity, to
    be stored in the rollout file soon.
    
    It also teaches app-server protocol / `ThreadHistoryBuilder` how to
    render those items, and adds the small legacy fanout helpers needed for
    existing event-based consumers. No core producer or rollout persistence
    behavior changes here, that will be done in a followup.
    
    ## Making ThreadHistoryBuilder stateless
    
    This is the first PR in a stack to make `ThreadHistoryBuilder` stateless
    enough that we can materialize app-server `ThreadItem`s from only a
    given slice of `RolloutItem` history, without ever needing to replay the
    whole thread from the beginning.
    
    The persisted legacy `RolloutItem::EventMsg` records are mostly shaped
    like live UI events, not like materialized `ThreadItem`s. They work if
    we replay the full rollout in order, but they often do not contain
    enough stable identity or complete item state to project an arbitrary
    suffix on its own.
    
    A few examples:
    
    - `UserMessageEvent` and `AgentMessageEvent` have content, but
    historically do not carry the persisted app-server item ID that should
    become the SQLite primary key.
    - `AgentReasoningEvent` and `AgentReasoningRawContentEvent` are
    fragments. `ThreadHistoryBuilder` currently merges them into the last
    reasoning item, which means a slice starting in the middle of reasoning
    cannot know whether to append to an earlier item or create a new one.
    - `WebSearchEndEvent`, `McpToolCallEndEvent`, collab end events, and
    similar legacy events can often render a final-looking item, but they
    usually rely on prior replay state to know which turn owns the item.
    - Begin/end legacy events are partial views of one logical item. The
    builder correlates them by `call_id` and mutates prior state to
    synthesize the final `ThreadItem`.
    
    That is the problem this direction fixes. A persisted canonical
    lifecycle record looks much closer to the read model we actually want
    later:
    
    ```rust
    ItemCompletedEvent {
        turn_id,
        item: TurnItem { id, ...full snapshot... },
        completed_at_ms,
    }
    ```
    
    Once rollout has explicit `turn_id`, stable `item.id`, and a canonical
    completed item snapshot, the future SQLite projector can reduce only the
    new rollout suffix and upsert the affected `thread_items` rows. It no
    longer needs to synthesize `item-N`, infer item ownership from the
    active turn, or replay earlier events just to reconstruct the current
    item snapshot.
    
    ## What changed
    
    - Added core `TurnItem` variants and item structs for command execution,
    dynamic tool calls, collab agent tool calls, and sub-agent activity.
    - Added conversions from those canonical items back into the legacy
    event shapes where current consumers still need them.
    - Added app-server v2 `ThreadItem` conversion for the new core item
    variants.
    - Taught `ThreadHistoryBuilder` and rollout persistence metrics to
    recognize the new item variants.
    
    ## Follow-up
    
    The next PR https://github.com/openai/codex/pull/30283 switches the live
    core producers for these item families onto canonical `ItemStarted` /
    `ItemCompleted` events.
  • [codex] group blocking and postmerge CI workflows (#30146)
    ## Why
    
    It's hard to change the set of required jobs when they're managed in the
    GitHub UI, and when each workflow is responsible for choosing it's own
    scheduling it's easy to end up with skew between what we enforce on PRs
    vs. on main.
    
    ## What
    
    - add a `blocking-ci` caller workflow, triggered by pull requests and
    pushes to `main`, for Bazel, blob size, cargo-deny, Codespell,
    `repo-checks`, rust CI, and SDK CI
    - add an `always()` terminal job named `CI required` that fails unless
    every called workflow succeeds
    - add a `postmerge-ci` caller workflow for `rust-ci-full` and
    `v8-canary`, with a terminal `Postmerge CI results` job
    - centralize V8 relevance detection in `v8_canary_changes.py`; unrelated
    PR and postmerge runs execute metadata only and skip the expensive build
    matrices
    - leave `v8-canary` outside the blocking gate and leave the external
    `cla` check independent
    
    ## Rollout
    
    A repository admin must replace the existing required GitHub Actions
    contexts with `CI required` in the main-branch ruleset. Retain `cla` as
    a separate required check. Until that change is coordinated, this PR
    cannot satisfy the old standalone check names. In-flight PRs will need
    to be rebased after this lands.
  • [codex] Support npm marketplace plugin sources (#29375)
    ## Why
    
    Marketplace source deserialization treated `{"source":"npm", ...}` as
    unsupported. The loader logged and skipped the entry, so npm-backed
    plugins never appeared in `plugin list --available` and `plugin add`
    returned "plugin not found".
    
    Codex plugins are installed from a plugin root, not from an npm
    dependency tree. For npm-backed marketplace entries, Codex should fetch
    the published package contents without running package scripts or
    installing unrelated dependencies.
    
    ## What changed
    
    - Add `npm` marketplace plugin sources with `package`, optional semver
    `version` or version range, and optional HTTPS `registry`.
    - Reject unsafe npm source fields before materialization, including
    invalid package names, non-semver version selectors, plaintext or
    credential-bearing registry URLs, and registry query/fragment data.
    - Materialize npm plugins with `npm pack --ignore-scripts`, then unpack
    the resulting tarball through the existing hardened plugin bundle
    extractor.
    - Enforce npm archive and extracted-size limits, require the standard
    npm `package/` archive root, and verify the extracted `package.json`
    name matches the requested package before installing.
    - Keep plugin listings, install-source descriptions, CLI JSON/human
    output, app-server v2 `PluginSource`, TUI source summaries, regenerated
    schema fixtures, and app-server documentation in sync.
    
    ## Impact
    
    Marketplaces can distribute Codex plugins from public or configured
    private HTTPS npm registries using the same install flow as existing
    materialized plugin sources. `npm` must be available on `PATH` when an
    npm-backed plugin is installed.
    
    Fixes #27831
    
    ## Validation
    
    - `just write-app-server-schema`
    - `just test -p codex-core-plugins -p codex-app-server-protocol -p
    codex-app-server -p codex-cli`
      - npm/schema/core-plugin coverage passed in the run.
    - The full focused command finished with `1739 passed`, `11 failed`, and
    `6 timed out`; the failures were unrelated local app-server environment
    failures from `sandbox-exec: sandbox_apply: Operation not permitted`
    plus one missing `test_stdio_server` helper binary.
    - Installed an npm-published Codex plugin package through a throwaway
    local marketplace and throwaway `CODEX_HOME` to exercise the real npm
    materialization path end to end.
  • [codex] Classify nested MCP authentication startup errors (#30257)
    ## Summary
    
    - classify authentication-required RMCP startup failures, including
    errors nested inside `ClientInitializeError::TransportError`
    - let `codex-mcp` consume that classification so the existing
    `reauthenticationRequired` startup failure reason is emitted
    - add a regression test that performs real startup with an expired
    persisted OAuth token and no refresh token
    
    ## Why
    
    Follow-up to #29877.
    
    RMCP stores streamable HTTP initialization failures inside a dynamic
    transport error whose payload is not exposed through the standard Rust
    error source chain. The original `anyhow::Error::chain()` check
    therefore missed the nested `AuthError::AuthorizationRequired` seen
    during real MCP startup and emitted `failureReason: null`.
    
    The transport-specific inspection now lives in `codex-rmcp-client`,
    while `codex-mcp` consumes only the domain-level authentication-required
    result. This classifier does not distinguish first-time login from
    reauthentication; the existing auth-state logic remains responsible for
    that distinction.
    
    ## User impact
    
    When stored MCP OAuth credentials are expired and cannot be refreshed,
    app clients now receive `failureReason: "reauthenticationRequired"` on
    the failed startup update and can show the reconnect action. First-time
    login and unrelated startup failures remain unchanged.
    
    ## Validation
    
    - `just test -p codex-rmcp-client --test streamable_http_oauth_startup
    identifies_expired_unrefreshable_token_startup_error`
    - `just test -p codex-mcp
    startup_outcome_error_identifies_authentication_required`
    - `just test -p codex-mcp
    mcp_startup_failure_reason_requires_existing_oauth_and_auth_failure`
    - `cargo build -p codex-cli --bin codex`
    - local app-server probe emitted `failureReason:
    "reauthenticationRequired"`
    - manual end-to-end reconnect flow confirmed
    - `just fmt`
  • Close thread persistence when submission channel closes (#30173)
    ### Summary
    
    Release live thread persistence when a session ends because its
    submission channel closes. This prevents a later same-process resume
    from failing with `thread ... already has a live local writer`.
    
    ### Details
    
    The issue is in the `codex-core` session teardown path used by Codex
    hosts, rather than in Managed Agents API or exec-server itself.
    
    Explicit shutdown already closes the `LiveThread`, which releases the
    process-scoped writer held by `LocalThreadStore`. The
    submission-channel-close fallback ran runtime and extension teardown but
    skipped that persistence shutdown, leaving the thread ID registered as
    having a live writer.
    
    This change:
    
    - closes the `LiveThread` on the channel-close fallback path;
    - preserves the existing teardown order used by explicit shutdowns;
    - extends the lifecycle regression test to assert that the thread store
    receives `shutdown_thread`.
    
    Context: [original
    report](https://openai.slack.com/archives/C0B4NBHQGTV/p1782136364948039),
    [recent occurrence
    1](https://openai.slack.com/archives/C0B4NBHQGTV/p1782434817895839?thread_ts=1782136364.948039&cid=C0B4NBHQGTV),
    [recent occurrence
    2](https://openai.slack.com/archives/C0B4NBHQGTV/p1782335107474429?thread_ts=1782136364.948039&cid=C0B4NBHQGTV)
    
    ### Testing
    
    - `just test -p codex-core
    submission_loop_channel_close_runs_full_thread_teardown`
    - `just test -p codex-core --lib` (1,989 passed; 3 skipped)
    - `just fix -p codex-core`
    - `just fmt`
    - Native code review: no findings
    
    I also attempted `just test -p codex-core`. The new regression passed;
    79 unrelated integration tests failed in the local harness, primarily
    because helper binaries such as `test_stdio_server` were unavailable,
    plus local proxy/shell timing failures.
  • feat: add GPT-5.6 variants to Bedrock catalog (#30285)
    ## Summary
    
    - add Sol (`openai.gpt-5.6-sol`), Terra (`openai.gpt-5.6-terra`), and
    Luna (`openai.gpt-5.6-luna`) to the Amazon Bedrock static model catalog
    - derive all three entries from the bundled GPT-5.5 metadata and add the
    Bedrock-only `max` reasoning effort
    - keep the new entries below the current GPT-5.5 and GPT-5.4 models at
    priorities 2, 3, and 4, preserving GPT-5.5 as the default
    - add deep-equality coverage for inherited model configuration, catalog
    ordering, context windows, and service-tier behavior
  • Let Codex consult user-level code-review-* skills. (#30143)
    ## Why
    
    I use the `$code-review` skill a lot and it'd be nice to add my own
    additional review criteria in `$CODEX_HOME/skills/code-review-*`.
    
    ## What
    
    Removes phrasing about "code-review-* skills in this repository" which
    in practice seems like enough to get Codex to consult my user-level code
    review skills in addition to the repo-level ones.
  • feat(app-server): add optional turn_id to thread/fork (#30277)
    ## Description
    
    This adds stable optional `turnId` support to `thread/fork`. When
    supplied, the fork copies persisted history through that terminal turn,
    inclusive, and drops later turns from the new thread.
    
    Omitting or passing `null` preserves the existing full-history fork
    behavior, including the interruption marker when the stored source
    history ends mid-turn.
    
    ## Why
    
    We're deprecating `thread/rollback` and this will help certain UX use
    cases work around it by using `thread/fork` + `turn_id` instead.
  • ensure thread.history_mode is immutable (#30261)
    ## Description
    
    This PR makes `thread.history_mode` immutable after the thread's
    canonical first `SessionMeta` has been written. Later same-thread
    `SessionMeta` lines are compatibility metadata writes, not a new thread
    definition.
    
    Without this, an older binary could append a `SessionMeta` that omits
    `history_mode`; when a newer binary replays it, serde defaults that
    missing field to `legacy` and SQLite could downgrade a paginated thread.
    
    ## Why
    
    `history_mode` is the persisted thread storage contract.
    Paginated-thread fail-closed behavior and SQLite memory filtering depend
    on it staying aligned with canonical rollout metadata, especially when
    multiple Codex binary versions can touch the same local rollout.
    
    ## What changed
    
    - Stop generic rollout metadata replay from overwriting `history_mode`
    from later `SessionMeta` items.
    - Remove `history_mode` from `ThreadMetadataPatch`, so mutable metadata
    sync and app-server metadata updates cannot rewrite it.
    - When local metadata sync has to recreate a missing SQLite row, recover
    `history_mode` from the rollout's canonical first `SessionMeta` instead
    of from a mutable patch.
    - Keep the in-memory thread store using the created thread's canonical
    `history_mode` instead of metadata patches.
    - Fill the one remaining core test `CreateThreadParams` initializer with
    the new `history_mode` field; Bazel CI caught this after the parent
    history-mode PR landed.
    
    ## Validation
    
    - `just fmt`
    - `just test -p codex-thread-store`
    - `just test -p codex-state
    session_meta_does_not_set_model_or_reasoning_effort`
  • [codex] Use managed defaults for TUI threads (#30147)
    ## Why
    
    #29683 exposes managed defaults for new-thread model settings through
    `configRequirements/read` without applying them server-wide. The TUI is
    an app-server client, so it should explicitly consume those defaults
    when it creates a fresh thread.
    
    This lets plain `codex` start on the managed model while preserving the
    existing ability to change model settings within the thread.
    
    ## What changed
    
    - Read `requirements.models.newThread` during TUI app-server bootstrap.
    - Apply the managed model, reasoning effort, and service tier to the
    initial fresh thread and subsequent `/new` or `/clear` threads.
    - Keep explicit launch overrides above the managed defaults.
    - Normalize the managed `fast` service tier to the `priority` request
    value.
    - Leave resumed and forked threads unchanged.
    
    The application logic lives in a small TUI-only module; app-server
    `thread/start` behavior remains unchanged for other clients.
    
    ## User experience
    
    - Plain `codex` starts with the managed new-thread settings.
    - A user can still change settings with `/model` or the existing
    service-tier controls.
    - Starting another fresh thread reapplies the managed defaults.
    - Explicit launch choices such as `codex -m <model>` continue to win.
    
    ## Validation
    
    - `just test -p codex-tui managed_new_thread_defaults`
    - `just fix -p codex-tui`
    
    Depends on #29683.
  • [codex] allow AGENTS.md and skills to authorize delegation (#30274)
    Prompt update of MAv2 to include agents.md and skills more explicitly
    
    should mimic: https://github.com/openai/codex/pull/27919
  • Overlap executor skill reads with namespace discovery (#30225)
    ## Why
    
    Environment skill discovery needs two independent pieces of information:
    
    - plugin namespaces from `plugin.json` files; and
    - skill metadata from each `SKILL.md` file.
    
    Today these happen in sequence. Codex waits for every plugin namespace
    lookup to finish before it starts reading any skill files. On a remote
    executor, that creates an avoidable network-latency barrier.
    
    ```text
    before: walk -> namespace lookups -> skill reads -> build catalog
    after:  walk -> namespace lookups ─┐
                 -> skill reads ───────┴-> build catalog
    ```
    
    ## What changes
    
    - Read and parse skill files without waiting for plugin namespace
    discovery.
    - Resolve root and nested plugin namespaces concurrently.
    - Join both results only when constructing the final qualified skill
    names.
    - Keep the existing 64-skill concurrency bound, output ordering,
    warnings, metadata behavior, and namespace rules.
    
    ## Testing
    
    The regression test makes plugin manifest lookup wait until a `SKILL.md`
    read has started. The old serialized pipeline would time out; the new
    pipeline completes and still returns the correctly namespaced skill.
    
    `just test -p codex-core-skills` passes all 111 tests.
    
    ## Out of scope
    
    This does not add an exec-server endpoint, batch filesystem calls, or
    reduce the number of files transferred. A frontmatter-only read or
    server-side skill catalog can remain a separate follow-up if benchmarks
    show that transferred bytes are the next bottleneck.
  • [codex] Add managed new-thread model settings (#29683)
    ## Why
    
    Admins need persistent defaults for the model, reasoning effort, and
    service tier shown when the Desktop App creates a new thread. These are
    initialization defaults rather than runtime constraints: the App should
    use them to initialize its draft while still allowing a user to make an
    explicit selection.
    
    The app-server therefore needs to expose the managed values before
    thread creation without changing `thread/start` behavior for other
    clients.
    
    ## What changed
    
    - Parse `model`, `model_reasoning_effort`, and `service_tier` from
    `[models.new_thread]` in `requirements.toml`.
    - Compose the `models` requirements through the existing
    requirements-layer precedence rules.
    - Expose the resolved values through `configRequirements/read` as
    `requirements.models.newThread`.
    - Add the corresponding app-server protocol types and regenerate the
    JSON and TypeScript schema fixtures.
    - Document the new `configRequirements/read` fields in the app-server
    README.
    
    ## Scope
    
    This PR is data plumbing only. It does not apply these values during
    `thread/start` and does not change thread creation for existing
    app-server clients, resumed or forked sessions, internal or subagent
    sessions, `codex exec`, or the TUI. A companion Desktop App change owns
    draft initialization, sends the effective settings for ordinary and
    prewarmed starts, and preserves explicit user changes.
    
    ## Validation
    
    - Requirements deserialization coverage for `[models.new_thread]`
    - Requirements-layer precedence coverage
    - App-server API mapping coverage
    - `configRequirements/read` integration coverage
    - Regenerated app-server JSON and TypeScript schema fixtures
  • fix main (#30276)
    Introduced by a merge race around thread.history_mode.
  • feat(app-server): add history_mode to thread (#29927)
    ## Description
    
    This PR adds a new `historyMode = "legacy" | "paginated"` to `Thread`.
    This will be stored in `SessionMeta` in the JSONL rollout file and as a
    new column in the SQLite thread_metadata table, and exposed on
    `thread/start` and on the `Thread` object in app-server.
    
    ## What changed
    
    - Added canonical `ThreadHistoryMode` with `legacy` and `paginated`,
    defaulting old and new SessionMeta to `legacy`.
    - Carried `history_mode` through core session config, ThreadStore stored
    metadata, local/in-memory stores, rollout metadata extraction, and the
    existing SQLite `threads` table.
    - Added experimental `historyMode` to app-server v2 `Thread` and
    `thread/start`.
    - Made paginated stored threads metadata-discoverable but unsupported
    for legacy full-history reads, `load_history`, live resume, and create
    paths.
    - Regenerated app-server schema fixtures and added
    protocol/state/thread-store/app-server coverage for persistence and
    fail-closed behavior.
    
    ## Compatibility floor
    Because users may be running various versions of Codex binaries on the
    same machine (TUI, Codex App, etc.), we will need to establish a
    compatibility floor for upcoming paginated threads, which will change
    how thread storage reads and writes work.
    
    The overall plan here:
    ```
    Release N:
    - Add historyMode to SessionMeta / Thread / SQLite metadata.
    - Teach binaries to understand paginated threads.
    - If a binary sees `historyMode="paginated"` but does not support the paginated contract, it refuses to resume/mutate the thread.
    - Default remains `"legacy"`.
    
    Release N+1:
    - First-party clients start opting into paginated threads where appropriate.
    - Internal dogfood / staged rollout.
    - Measure old-client usage and paginated-thread unsupported errors.
    
    Release N+2:
    - Only after Release N+ is overwhelmingly deployed, make paginated the default.
    - Accept that a small tail of N-1-or-older binaries may not understand paginated threads.
    ```
    
    The important behavior change is fail-closed handling for a binary that
    encounters a persisted `paginated` thread before it knows how to fully
    support paginated history. In app-server, if a thread is `paginated`, we
    will:
    
    - allow metadata-only discovery paths like `thread/list` and
    `thread/read(includeTurns=false)`, so clients can still see the thread
    and inspect its `historyMode`
    - reject legacy full-history/live-thread paths like
    `thread/read(includeTurns=true)` and `thread/resume` with an unsupported
    JSON-RPC error
    - avoid silently treating an unknown or future `historyMode` as `legacy`
    
    Under the hood, the ThreadStore layer also rejects legacy operations
    that would need to load or replay the full thread history for a
    paginated thread. That gives us the behavior we want for Release N:
    future paginated threads are visible, but this binary fails closed
    instead of trying to operate on them as if they were legacy threads.
  • Relax hooks.json top-level metadata validation (#30229)
    ## Summary
    - Allow a top-level `description` string in `hooks.json`.
    - Continue rejecting unknown top-level keys and root-level hook events;
    events must remain under `hooks`.
    
    ## Testing
    - `just test -p codex-config`
  • [codex] narrow unused skills intro export (#29991)
    ## Summary
    
    - stop publicly re-exporting the internally used
    `SKILLS_INTRO_WITH_ALIASES` constant
    - keep the constant and all skills rendering behavior unchanged
    - preserve every integration helper, API, fixture, assertion, and module
    used by tests
    
    ## Scope guardrails
    
    This revision keeps all remote/network-facing functionality and every
    line introduced by `jif <jif@openai.com>`.
    
    Following the test-preservation audit, it also restores the in-process
    RMCP test transport, the original `codex-mcp` fixture,
    `PluginLoadOutcome::effective_skill_roots` and its assertions, the
    `EffectiveSkillRoots` API family, the test-only apps renderer, and the
    TUI dead-code annotation. Those files now match the PR base exactly.
    
    No test imports or directly references the remaining public skills
    export being narrowed.
    
    ## Validation
    
    - repository-wide test-reference audit: no test-used code remains
    deleted or narrowed
    - deleted-line `git blame` audit: zero Jif-authored deletions
    - `cargo test -p codex-core-plugins -p codex-mcp -p codex-rmcp-client
    --lib`: 467 passed
    - `cargo test -p codex-core --lib apps::render`: 2 passed
    - `cargo test -p codex-core-skills --lib render::tests`: 19 passed
    - `cargo check -p codex-core-skills --all-targets`: passed
    - `just fix -p codex-core-skills`: passed
    - `just fmt`: passed
    - `git diff --check`: passed
    
    The full local `codex-core-skills` suite passed 106/108 tests; two
    loader tests detected an ambient repository skills root outside the
    package and failed their isolation assertions. The scoped renderer suite
    and all-target compile pass, and CI runs in an isolated environment.
    
    Final code delta: 1 insertion, 2 deletions across 2 files.
  • Test selected capabilities across unavailable resume (#30215)
    ## Why
    
    The selected-capability integration test already covers initial
    attachment and cold resume, but it resumes while the selected executor
    is still reachable.
    
    That leaves an important World State transition untested: a thread
    remembers its selected capability root, resumes while that environment
    is unavailable, and later sees the same stable environment return.
    
    ## What this tests
    
    This extends the existing end-to-end scenario:
    
    ```text
    selected executor available
            ↓
    app-server stops and the executor goes away
            ↓
    thread resumes with the executor unavailable
            ↓
    skills, selected MCP tools, and connector attribution are absent
            ↓
    the same environment ID is attached again
            ↓
    skills, MCP tools, and connector attribution return
    ```
    
    The test also checks that the unavailable snapshot explicitly tells the
    model that no selected-environment skills are currently available. After
    reattachment, it invokes the selected skill again and verifies that a
    new executor-owned MCP process starts.
    
    ## Scope
    
    This is test-only. It keeps the existing assumption that an environment
    ID refers to stable capability contents. It does not add package-file
    invalidation or live transport reconnect behavior.
  • Reuse MCP runtimes when selected availability changes nothing (#30148)
    ## Why
    
    MCP runtime reuse was keyed by every ready selected-capability
    environment, even when an environment contributed no MCP servers or
    connectors.
    
    For example:
    
    1. a global stdio MCP is running;
    2. a selected remote environment contains only a skill;
    3. that environment becomes ready;
    4. the MCP and connector projection stays exactly the same;
    5. Codex nevertheless rebuilds the MCP manager and restarts the global
    stdio process.
    
    That restart can interrupt active calls and discard process-local state
    even though nothing about MCP changed.
    
    ## What changes
    
    When selected-environment availability changes, Codex now resolves the
    candidate MCP and connector projection before deciding whether to
    replace the runtime:
    
    - if the winning MCP servers or their ownership change, rebuild as
    before;
    - if the selected connector snapshot changes, rebuild as before;
    - if an enabled MCP is explicitly bound to an environment whose
    availability changed, rebuild as before;
    - otherwise, keep the exact live manager and processes, and update only
    the availability input remembered by the snapshot.
    
    ```text
    ready selected environments:  [] -> [skills-env]
    resolved MCP servers:          {global_probe} -> {global_probe}
    resolved connectors:           {} -> {}
    result:                         reuse manager; keep the same process
    ```
    
    The comparison uses the resolved winning servers and their sources, so
    plugin/config ownership remains part of the runtime identity.
    
    ## Existing stack coverage
    
    The integration PR directly below this one already covers both rebuild
    boundaries: a selected MCP becomes callable and a selected connector
    tool becomes model-visible when their environment becomes available. It
    also verifies that an unchanged selected MCP runtime keeps its process.
    
    This PR does not add another remote-attachment integration scenario for
    the no-change optimization. `environment/add` returns before readiness,
    and app-server does not currently expose a deterministic readiness
    signal for an environment that contributes only skills. Keeping a
    fixed-delay test would add flake risk; adding a new readiness API would
    be outside this fix.
    
    ## Scope and assumptions
    
    - This does not change skill discovery, World State rendering, or plugin
    metadata caching.
    - This does not add file watching or hot reload behavior.
    - This does not change disconnect/reconnect handling.
    - Selected environment IDs and their capability contents retain the
    stack's existing stability assumption.
    - Delayed `required = true` executor MCP behavior remains out of scope.
  • [codex] fix CreateThreadParams test initializer (#30198)
    ## Summary
    
    - initialize `selected_capability_roots` in the new
    `attach_in_memory_thread_store` test helper
    - restore `codex-core` test compilation on `main`
    
    ## Root cause
    
    [#30144](https://github.com/openai/codex/pull/30144) added the helper
    from commit `0c3d0742`, whose parent was `c38b2e9b`. That branch was
    based before [#29856](https://github.com/openai/codex/pull/29856) added
    `selected_capability_roots` as a required field on `CreateThreadParams`.
    
    The PR's Rust and Bazel workflows both passed against the stale branch
    head `0c3d0742`. When #30144 was squashed onto newer `main`, its
    initializer was integrated alongside the required field from #29856,
    producing `E0063` in `core/src/session/tests.rs`. Because those
    workflows tested the branch head rather than the integrated merge
    result, they did not see the version-skew failure before merge.
    
    ## Impact
    
    Any job that compiles the `codex-core` library tests fails, which turned
    the main-branch `rust-ci-full` and `Bazel` workflows red across
    platforms and blocks unrelated focused core tests. This change only
    completes the test initializer; it does not alter production behavior or
    workflow configuration.
    
    ## Validation
    
    - `just fmt`
    - `just test -p codex-core
    turn_complete_flushes_terminal_event_after_delivery` (1 passed, 2909
    skipped)
    - `git diff --check`
  • [codex] wire process-owned code mode host into core (#30142)
    ## Summary
    
    - add the `code_mode_host` feature flag and select
    `ProcessOwnedCodeModeSessionProvider` in `CodeModeService` when enabled
    - initialize code-mode sessions lazily so a missing host reports a tool
    error without failing thread startup
    - resolve `codex-code-mode-host` beside the running Codex binary by
    default while preserving `CODEX_CODE_MODE_HOST_PATH` as an override
    - add unit and end-to-end coverage for host resolution and graceful
    missing-host behavior
    
    ## Why
    
    This wires the process-owned session client from #30112 into the core
    service behind an opt-in rollout gate. Packaged Codex installations can
    place the helper in the same `bin` directory as the main executable
    without relying on `PATH`, while development and custom installations
    can continue to override the helper path.
    
    ## Stack
    
    - Depends on #30112
    - Base branch: `cconger/process-owned-session-runtime-4-client`
    
    ## Validation
    
    Build `codex` and `codex-code-mode-host`
    `CODEX_CODE_MODE_HOST_PATH="$PWD/target/debug/codex-code-mode-host"
    ./target/debug/codex --enable code_mode_host`
  • [codex] add process-owned code-mode session client (#30112)
    ## Summary
    
    - add `ProcessOwnedCodeModeSessionProvider` and logical session
    generation/rebinding state
    - add the supervised child-process connection, reader/writer tasks, and
    driver state machine
    - make dropped execute/wait/open callers cancellation-safe with explicit
    ownership handoff and durable cleanup
    - validate cell/delegate lifecycle state and reject invalid protocol
    transitions
    - add end-to-end stdio coverage for delegates, cancellation, frame
    limits, child loss, stale generations, replacement, and long-lived
    sessions
    
    ## Why
    
    This final stage exposes the process-owned client only after the wire
    protocol, host-safe runtime, and standalone host are independently in
    place. Transport failure is fail-stop: the client closes local state,
    cancels callbacks, reaps the child, and lazily rebuilds a fresh host
    generation rather than transactionally recovering the old connection.
    
    ## Stack
    
    This is **4 of 4** in the process-owned code-mode session stack.
    
    - Depends on #30111
    - Full stack: #30108#30110#30111 → this PR
    
    ## Validation
    
    - `just test -p codex-code-mode -p codex-code-mode-host` — 86 passed
    - `just fix -p codex-code-mode`
    - `just fix -p codex-code-mode-host`
    - `just bazel-lock-update`
    - `just bazel-lock-check`
    - `bazel test //codex-rs/code-mode:code-mode-unit-tests
    //codex-rs/code-mode-host:code-mode-host-unit-tests
    //codex-rs/code-mode-host:code-mode-host-stdio-test
    //codex-rs/code-mode-protocol:code-mode-protocol-unit-tests` — 4/4
    passed
    - `just fmt`
  • Persist Cloudflare affinity cookies for MCP HTTP (#29516)
    [Codex Thread
    019ef1f9-36e2-7e91-9337-504f097b9dc1](https://codex-thread-link.openai.chatgpt-team.site/thread/019ef1f9-36e2-7e91-9337-504f097b9dc1)
    
    ## Why
    
    Hosted plugin-service Streamable HTTP MCP traffic uses
    `https://chatgpt.com/backend-api/ps/mcp` and depends on Cloudflare's
    `__cflb` cookie for load-balancer affinity. The local and exec-server
    `http/request` path built a fresh reqwest client for each request
    without installing Codex's existing shared ChatGPT Cloudflare cookie
    store, so affinity could be lost between calls.
    
    This is an affinity-hardening change motivated by an incident
    investigation. It does not establish the broader connector-cache
    incident RCA or claim to fix that incident in full.
    
    ## What changed
    
    - Install the existing process-local, strictly allowlisted ChatGPT
    Cloudflare cookie store on the reqwest client used by
    `ReqwestHttpClient`.
    - Fresh clients now share allowed Cloudflare infrastructure cookies
    within the process that originates the local or exec-server network
    request.
    - Keep the existing HTTPS ChatGPT-host and Cloudflare-cookie-name
    restrictions. This does not introduce a general cookie jar or send
    ChatGPT Cloudflare cookies to unrelated hosts.
    
    ## Test coverage
    
    - `codex-client` unit coverage verifies that the existing strict store
    accepts and returns `__cflb` for HTTPS ChatGPT URLs.
    - The exec-server HTTPS integration test sends four independent
    `http/request` calls through a local TLS-intercepting proxy and verifies
    that:
    - `Set-Cookie: __cflb=west` is sent on the next plugin-service request;
      - a later `Set-Cookie: __cflb=central` replaces the stored value;
      - non-Cloudflare session cookies are discarded;
      - no stored ChatGPT Cloudflare cookie is sent to a non-ChatGPT host.
    - `just test -p codex-client` — 38 passed.
    - `just test -p codex-exec-server --test chatgpt_cloudflare_affinity` —
    1 passed.
    - `just bazel-lock-check` — passed.
    
    ## Non-goals
    
    - No persistence of ChatGPT auth, account, session, residency, or
    arbitrary cookies.
    - No cookie persistence for third-party MCP servers.
    - No special composition of caller-provided `Cookie` headers.
    - No plugin-service, connector-cache, Habitat/habicache, routing,
    redirect, or API-contract changes.
    - No broader incident RCA conclusions.
  • Retry failed Codex Apps MCP startup (#29920)
    ## Problem
    
    The built-in Codex Apps MCP client shares a future for the full startup
    operation: connect, complete `initialize`, fetch the initial tools, and
    return a usable client. Sharing deduplicates startup work, but it also
    memoizes terminal errors.
    
    After a transient connection, handshake, or initial `tools/list`
    failure, later tool builds observe the same failed future. The thread
    cannot reconnect after the backend recovers and continues serving its
    startup-time cached tool snapshot, which may be empty or stale.
    
    ## Fix
    
    When Apps MCP startup ends in an error, Codex starts bounded recovery
    without putting startup latency on tool-router construction:
    
    1. The current tool build immediately continues with the cached startup
    snapshot.
    2. After the initial failure is reported, Codex starts one fresh full
    startup attempt in the background.
    3. Concurrent tool builds share that in-flight attempt and also continue
    with cached tools.
    4. On success, the recovered client becomes active, refreshes the Apps
    tools cache, emits a `Ready` startup status, and is reused by later
    operations.
    5. On failure, the cache remains unchanged and later tool builds may
    start another background attempt after exponential cooldown: 1s, 2s, 4s,
    8s, 16s, then 30s maximum.
    
    Each recreated startup performs a fresh MCP `initialize` and uncached
    `tools/list`. The MCP client retains its existing bounded retries for
    retryable `initialize` and `tools/list` failures.
    
    This avoids adding the Apps startup timeout to every request during a
    sustained outage.
    
    ## Scope
    
    This is limited to the built-in Codex Apps MCP client:
    
    - no reconnects for user-configured MCP servers;
    - no cache deletion; and
    - no proactive refresh for a healthy client with stale tools.
    
    ## Tests
    
    Coverage verifies:
    
    - tool builds return cached tools without waiting for a blocked
    reconnect;
    - concurrent tool builds start only one background reconnect;
    - failed reconnects preserve cached tools and respect exponential
    cooldown;
    - a recovered client is retained and reused; and
    - a long-lived thread exposes recovered app tools on a later follow-up.
    
    Validation:
    
    - `just test -p codex-mcp` — 95 passed
    - `just test -p codex-core
    later_follow_up_uses_background_recovered_apps_after_mid_thread_startup_failures
    --no-capture` — passed
    - `just fix -p codex-mcp`
    - `just fmt`
  • [codex] fix terminal rollout event durability (#30144)
    Currently session code does not flush the thread store after appending
    the `TurnComplete` / `TurnAborted` events.
    
    This isn't a problem in practice for local storage because append_items
    itself effectively blocks, but any thread stores that buffer in
    append_items and only commit on flush effectively never get these events
    persisted.
    
    The fix adds explicit rollout flushes at the terminal emitters after
    normal completion and interruption.
    
    Added test cases that assert the number of flushes when completing or
    aborting turns. These are admittedly a little brittle and I'm open to
    better ideas on how to add automated testing.
  • Test selected capabilities across availability and resume (#30157)
    ## Why
    
    This stack crosses World State, executor skills, selected plugin
    metadata, MCP processes, connectors, dynamic environments, and resume.
    This PR adds two end-to-end scenarios that validate those pieces
    together.
    
    Both tests enable `deferred_executor`, so they exercise the real
    delayed-environment path.
    
    ## Scenario 1: availability across turns and resume
    
    ```text
    1. Start a thread with one selected plugin root bound to E1.
    2. E1 is unavailable.
       - executor skill is absent
       - selected MCP is absent
       - connector has no selected-plugin attribution
    3. Start E1 and register the same stable environment ID.
    4. Start a new turn.
       - the executor skill appears through World State
       - its body beats a colliding host skill
       - the selected MCP tool is advertised and executes inside E1
       - the connector is attributed to the selected plugin
    5. Start another turn without changing E1.
       - the MCP PID stays the same, proving runtime reuse
    6. Restart app-server and resume the thread.
       - durable selected-root intent is restored
       - skills, MCP, and connector attribution are restored
       - a new MCP PID proves ephemeral process state was rebuilt
    ```
    
    ## Scenario 2: availability changes inside one turn
    
    ```text
    1. Start a turn while E1 is unavailable.
    2. The first model sample sees no executor skill, MCP, or selected connector.
    3. The turn pauses on request_user_input.
    4. Start E1 and register it while that same turn is still active.
    5. Continue the turn.
    6. The very next model sample sees:
       - the executor skill catalog
       - the selected MCP tool
       - selected-plugin connector attribution
    7. The model calls the MCP, and its output proves execution happened inside E1.
    ```
    
    This second scenario specifically protects the aeon-style behavior:
    capability state is captured again for every sampling step, not only at
    the next user turn.
    
    ## Scope
    
    These are integration tests only. They do not add a combinatorial matrix
    for unsupported plugin-file mutation, environment generations, transport
    disconnects, or delayed `required = true` executor MCPs.
  • [codex] allow CCA image generation and web search extensions (#29909)
    ## Summary
    
    - allow the standalone image-generation and web-search extensions for
    the actor-authorized provider shape used by CCA
    - preserve builtin `image_generation` and `web_search` for older models
    and existing flows
    - keep ordinary non-OpenAI providers excluded from both extensions
    - remove only the image extension local managed-AuthManager requirement
    that CCA cannot satisfy
    - share actor-authorization detection through `ModelProviderInfo`
    - keep Core tests focused on routing behavior and cover header-shape
    edge cases in `model-provider-info`
    - add a Responses Lite regression that verifies both
    `image_gen.imagegen` and `web.run`
    
    ## Why
    
    CCA uses a provider named `local` with `requires_openai_auth: false` and
    a non-empty `x-openai-actor-authorization` header. Core accepts that
    provider shape, but both extension provider-name gates rejected it;
    image generation additionally required a Codex-managed login.
    
    The standalone paths must coexist with existing builtin tools. New
    Responses Lite models can receive `image_gen.imagegen` and `web.run`,
    while older models continue using builtin tools.
    
    ## Impact
    
    This enables both standalone extensions for CCA once installed
    downstream, without removing or changing builtin-tool compatibility for
    older models.
    
    ## Validation
    
    - `just test -p codex-core
    responses_lite_exposes_standalone_tools_for_actor_authorized_provider`
    - `just test -p codex-core
    responses_lite_uses_standalone_web_search_and_image_generation`
    - `just test -p codex-core
    hosted_tools_follow_provider_auth_model_and_config_gates`
    - `just test -p codex-image-generation-extension`
    - `just test -p codex-web-search-extension`
    - `just test -p codex-model-provider-info`
    - `just fmt`
    - `git diff --check`
  • Expose MCP app identity in app context (#29934)
    ## Why
    
    MCP tool-call events need to expose trusted app identity and action
    metadata directly so v2 clients do not have to infer it from tool names
    or resource URIs.
    
    ## What changed
    
    - Add optional `appName`, `templateId`, and `actionName` fields to MCP
    tool-call `appContext`.
    - Populate `appName` and `templateId` from trusted Codex Apps metadata,
    and derive `actionName` from the trusted app resource metadata.
    - Preserve all three fields through core events, legacy protocol events,
    persisted thread history, resume redaction, and app-server v2 responses.
    - Document the public `appContext` fields in
    `codex-rs/app-server/README.md`.
    - Regenerate app-server JSON and TypeScript schemas and add coverage for
    serialization, persistence, redaction, and metadata propagation.
    
    ## Validation
    
    - `just test -p codex-app-server-protocol mcp_tool_call`
    - `just test -p codex-core
    mcp_tool_call_item_metadata_only_trusts_codex_apps_identity
    mcp_tool_call_item_includes_app_identity`
    - `just write-app-server-schema`
    
    ---------
    
    Co-authored-by: Martin Au-Yeung <280153141+martinauyeung-oai@users.noreply.github.com>
  • Keep MCP elicitation routable across runtime refreshes (#30127)
    ## Why
    
    An MCP tool call can still be waiting for an elicitation response when
    an environment update replaces the thread's MCP runtime.
    
    Before this change:
    
    ```text
    runtime A starts a tool call and asks the user
    environment becomes ready, so runtime B is published
    client answers the prompt through runtime B
    runtime B cannot find runtime A's pending responder
    ```
    
    The response is lost and the original tool call stays blocked.
    
    ## What changed
    
    All MCP runtimes for one thread now share a small elicitation router:
    
    ```text
    runtime A ---\
                   shared router: response token -> exact pending responder
    runtime B ---/
    ```
    
    When Codex surfaces an MCP elicitation, it assigns a unique opaque
    response token. The router records which pending request owns that
    token. A replacement runtime reuses the same router, so the latest
    runtime can deliver a response to a request started by the previous
    runtime.
    
    The Codex-owned token also prevents two runtime connections that reuse
    the same MCP server request ID from receiving each other's responses.
    
    This does not retain or search old MCP managers. Only the pending
    responder map is shared.
    
    ## Covered scenario
    
    The integration test exercises the complete failure mode:
    
    1. A thread starts while its selected environment is still unavailable.
    2. A configured MCP server starts a tool call and asks the client for
    input.
    3. The environment becomes ready, causing Codex to publish a replacement
    MCP runtime.
    4. The client answers the original prompt after the replacement.
    5. The original tool call receives that answer and completes.
    
    A focused routing test also creates two runtimes with the same server
    request ID and verifies that each response reaches the exact request
    that emitted its token.
    
    ## Scope
    
    This PR changes only elicitation response routing across MCP runtime
    replacement. It does not change when runtimes are rebuilt, which
    environments contribute MCP configuration, or how environment
    availability is detected.
  • Reinject missing World State fragments on resume (#30152)
    ## Why
    
    World State restores its structured snapshot on resume so unchanged
    sections do not have to be rendered again. That is safe only when the
    model-visible fragment represented by the snapshot is still present in
    retained history.
    
    For selected executor skills, the failing selected-capability scenario
    exposed this state:
    
    ```text
    persisted World State: selected skill catalog is known
    retained model history: selected skill catalog message is missing
    next diff: unchanged, so emit nothing
    ```
    
    The model resumes without being told about the selected skill catalog.
    
    ## What changed
    
    World State contributions may now optionally describe the concrete
    model-visible fragment that must remain in retained history.
    
    When a persisted snapshot is present:
    
    ```text
    matching retained fragment exists -> trust snapshot, emit nothing
    matching retained fragment missing -> treat section as absent, render current state once
    ```
    
    The skills extension uses this for non-empty selected-environment
    catalogs by matching its exact rendered catalog body. Empty or hidden
    catalogs do not require a fragment.
    
    ## Scope
    
    This does not clear or rebuild the whole World State baseline. It does
    not change skill discovery, cache invalidation, environment
    availability, or MCP runtime behavior. It only keeps a persisted section
    snapshot and its retained model context consistent across resume/history
    reconstruction.
    
    ## Coverage
    
    A focused World State regression test verifies both sides:
    
    - a missing retained fragment is rendered again
    - a matching retained fragment avoids duplicate injection
  • [codex] Attribute app-server analytics by thread originator (#29935)
    ## Why
    
    Desktop Work threads and regular Codex threads can share the same
    app-server connection. App-server analytics currently copy
    `product_client_id` from connection metadata for every thread-scoped
    event, so Work thread activity is attributed to the Desktop connection
    instead of the thread's resolved originator. This prevents analytics
    from distinguishing the two products on a shared connection.
    
    ## What changed
    
    - Publish the resolved originator after a thread is materialized,
    covering new, resumed, forked, and subagent threads.
    - Store that originator in the analytics reducer's existing per-thread
    state.
    - Override only `app_server_client.product_client_id` for thread, turn,
    tool, review, goal, guardian, and compaction events while preserving the
    connection's client name, version, and transport metadata.
    - Fall back to the connection-wide product client ID when a thread has
    no originator override.
    - Preserve persisted originators in thread initialization analytics for
    resume and fork flows.
    
    ## Validation
    
    - `just test -p codex-analytics
    thread_originator_overrides_shared_connection_across_thread_events
    subagent_events_keep_thread_originator_with_explicit_turn_connection`
    - `just test -p codex-app-server
    turn_start_tracks_thread_originator_in_analytics
    thread_start_tracks_thread_initialized_analytics
    thread_fork_tracks_thread_initialized_analytics
    thread_resume_tracks_thread_initialized_analytics`
    - `just test -p codex-core thread_manager`