Commit Graph

3846 Commits

  • app-server: improve thread resume rejoin flow (#11776)
    thread/resume response includes latest turn with all items, in band so
    no events are stale or lost
    
    Testing
    - e2e tested using app-server-test-client using flow described in
    "Testing Thread Rejoin Behavior" in
    codex-rs/app-server-test-client/README.md
    - e2e tested in codex desktop by reconnecting to a running turn
  • app-server: fix flaky list_apps_returns_connectors_with_accessible_flags test (#12286)
    ## Why
    
    `app/list` emits `app/list/updated` after whichever async load finishes
    first (directory connectors or accessible tools). This test assumed the
    directory-backed update always arrived first because it injected a tools
    delay, but that assumption is not stable when the process-global Codex
    Apps tools cache is already warm. In that case the accessible-tools path
    can return immediately and the first notification shape flips, which
    makes the assertion flaky.
    
    Relevant code paths:
    
    -
    [`codex-rs/app-server/src/codex_message_processor.rs`](https://github.com/openai/codex/blob/13ec97d72e3482f16c62e0a22025a0542133e623/codex-rs/app-server/src/codex_message_processor.rs#L4949-L5034)
    (concurrent loads + per-load `app/list/updated` notifications)
    -
    [`codex-rs/core/src/mcp_connection_manager.rs`](https://github.com/openai/codex/blob/13ec97d72e3482f16c62e0a22025a0542133e623/codex-rs/core/src/mcp_connection_manager.rs#L1182-L1197)
    (Codex Apps tools cache hit path)
    
    ## What Changed
    
    Updated
    `suite::v2::app_list::list_apps_returns_connectors_with_accessible_flags`
    in `codex-rs/app-server/tests/suite/v2/app_list.rs` to accept either
    valid first `app/list/updated` payload:
    
    - the directory-first snapshot
    - the accessible-tools-first snapshot
    
    The test still keeps the later assertions strict:
    
    - the second `app/list/updated` notification must be the fully merged
    result
    - the final `app/list` response must match the same merged result
    
    I also added an inline comment explaining why the first notification is
    intentionally order-insensitive.
    
    ## Verification
    
    - `cargo test -p codex-app-server`
  • tests: centralize in-flight turn cleanup helper (#12271)
    ## Why
    
    Several tests intentionally exercise behavior while a turn is still
    active. The cleanup sequence for those tests (`turn/interrupt` + waiting
    for `codex/event/turn_aborted`) was duplicated across files, which made
    the rationale easy to lose and the pattern easy to apply inconsistently.
    
    This change centralizes that cleanup in one place with a single
    explanatory doc comment.
    
    ## What Changed
    
    ### Added shared helper
    
    In `codex-rs/app-server/tests/common/mcp_process.rs`:
    
    - Added `McpProcess::interrupt_turn_and_wait_for_aborted(...)`.
    - Added a doc comment explaining why explicit interrupt + terminal wait
    is required for tests that intentionally leave a turn in-flight.
    
    ### Migrated call sites
    
    Replaced duplicated interrupt/aborted blocks with the helper in:
    
    - `codex-rs/app-server/tests/suite/v2/thread_resume.rs`
      - `thread_resume_rejects_history_when_thread_is_running`
      - `thread_resume_rejects_mismatched_path_when_thread_is_running`
    - `codex-rs/app-server/tests/suite/v2/turn_start_zsh_fork.rs`
      - `turn_start_shell_zsh_fork_executes_command_v2`
    -
    `turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2`
    - `codex-rs/app-server/tests/suite/v2/turn_steer.rs`
      - `turn_steer_returns_active_turn_id`
    
    ### Existing cleanup retained
    
    In `codex-rs/app-server/tests/suite/v2/turn_start.rs`:
    
    - `turn_start_accepts_local_image_input` continues to explicitly wait
    for `turn/completed` so the turn lifecycle is fully drained before test
    exit.
    
    ## Verification
    
    - `cargo test -p codex-app-server`
  • skill-creator: lazy-load PyYAML in frontmatter parsing (#12080)
    init-skill should work even without PyYAML
  • Update pnpm versions to fix cve-2026-24842 (#12009)
    Update pnpm versions to resolve CVE-2026-24842
  • tests(thread_resume): interrupt running turns in resume error-path tests (#12269)
    ## Why
    
    `thread_resume` tests can intentionally create an in-flight turn, assert
    a `thread/resume` error path, and return immediately. That leaves turn
    work active during teardown, which can surface as intermittent `LEAK`
    failures.
    
    Sample output that motivated this investigation (reported during test
    runs):
    
    ```text
    LEAK ... codex-app-server::all suite::v2::thread_resume::thread_resume_rejoins_running_thread_even_with_override_mismatch
    ```
    
    ## What Changed
    
    Updated only `codex-rs/app-server/tests/suite/v2/thread_resume.rs`:
    
    - `thread_resume_rejects_history_when_thread_is_running`
    - `thread_resume_rejects_mismatched_path_when_thread_is_running`
    
    Both tests now:
    
    1. capture the running turn id from `TurnStartResponse`
    2. assert the expected `thread/resume` error
    3. call `turn/interrupt` for that running turn
    4. wait for `codex/event/turn_aborted` before returning
    
    ## Why This Is The Correct Fix
    
    These tests are specifically validating resume behavior while a turn is
    active. They should also own cleanup of that active turn before exiting.
    Explicitly interrupting and waiting for the terminal abort notification
    removes teardown races and avoids relying on process-drop behavior to
    clean up in-flight work.
    
    ## Repro / Verification
    
    Repro command used for investigation:
    
    ```bash
    cargo nextest run -p codex-app-server -j 2 --no-fail-fast --stress-count 50 --status-level leak --final-status-level fail -E 'test(suite::v2::thread_resume::thread_resume_rejoins_running_thread_even_with_override_mismatch) | test(suite::v2::thread_resume::thread_resume_rejects_history_when_thread_is_running) | test(suite::v2::thread_resume::thread_resume_rejects_mismatched_path_when_thread_is_running) | test(suite::v2::thread_resume::thread_resume_keeps_in_flight_turn_streaming)'
    ```
    
    Observed before this change: intermittent `LEAK` in
    `thread_resume_rejects_history_when_thread_is_running`.
    
    Also verified with:
    
    - `cargo test -p codex-app-server`
    
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/12269).
    * #12271
    * __->__ #12269
  • feat(config): add permissions.network proxy config wiring (#12054)
    ## Summary
    
    Implements the `ConfigToml.permissions.network` and uses it to populate
    `NetworkProxyConfig`. We now parse a new nested permissions/network
    config shape which is converted into the proxy’s runtime config.
    
    When managed requirements exist, we still apply those constraints on top
    of user settings (so managed policy still wins).
    
    * Cleaned up the old constructor path so it now accepts both user config
    + managed constraints directly.
    * Updated the reload path so live proxy config reloads respect
    [permissions.network] too, while still supporting the existing top-level
    [network] format.
    
    ### Behavior
    - User-defined `[permissions.network]` values are now honored.
    - Managed constraints still take effect and are validated against the
    resulting policy.
  • [bazel] Fix proc_macro_dep libs (#12274)
    If a first-party proc_macro crate has tests/binaries that would get
    autogenerated by the macro, it was being handled incorrectly. Found by
    an external OS contributor!
  • Add configurable MCP OAuth callback URL for MCP login (#11382)
    ## Summary
    
    Implements a configurable MCP OAuth callback URL override for `codex mcp
    login` and app-server OAuth login flows, including support for non-local
    callback endpoints (for example, devbox ingress URLs).
    
    ## What changed
    
    - Added new config key: `mcp_oauth_callback_url` in
    `~/.codex/config.toml`.
    - OAuth authorization now uses `mcp_oauth_callback_url` as
    `redirect_uri` when set.
    - Callback handling validates the callback path against the configured
    redirect URI path.
    - Listener bind behavior is now host-aware:
    - local callback URL hosts (`localhost`, `127.0.0.1`, `::1`) bind to
    `127.0.0.1`
      - non-local callback URL hosts bind to `0.0.0.0`
    - `mcp_oauth_callback_port` remains supported and is used for the
    listener port.
    - Wired through:
      - CLI MCP login flow
      - App-server MCP OAuth login flow
      - Skill dependency OAuth login flow
    - Updated config schema and config tests.
    
    ## Why
    
    Some environments need OAuth callbacks to land on a specific reachable
    URL (for example ingress in remote devboxes), not loopback. This change
    allows that while preserving local defaults for existing users.
    
    ## Backward compatibility
    
    - No behavior change when `mcp_oauth_callback_url` is unset.
    - Existing `mcp_oauth_callback_port` behavior remains intact.
    - Local callback flows continue binding to loopback by default.
    
    ## Testing
    
    - `cargo test -p codex-rmcp-client callback -- --nocapture`
    - `cargo test -p codex-core --lib mcp_oauth_callback -- --nocapture`
    - `cargo check -p codex-cli -p codex-app-server -p codex-rmcp-client`
    
    ## Example config
    
    ```toml
    mcp_oauth_callback_port = 5555
    mcp_oauth_callback_url = "https://<devbox>-<namespace>.gateway.<cluster>.internal.api.openai.org/callback"
  • fix(bazel): replace askama templates with include_str! in memories (#11778)
    ## Summary
    
    - The experimental Bazel CI builds fail on all platforms because askama
    resolves template paths relative to `CARGO_MANIFEST_DIR`, which points
    outside the Bazel sandbox. This produces errors like:
      ```
    error: couldn't read
    `codex-rs/core/src/memories/../../../../../../../../../../../work/codex/codex/codex-rs/core/templates/memories/consolidation.md`:
    No such file or directory
      ```
    - Replaced `#[derive(Template)]` + `#[template(path = "...")]` with
    `include_str!` + `str::replace()` for the three affected templates
    (`consolidation.md`, `stage_one_input.md`, `read_path.md`).
    `include_str!` resolves paths relative to the source file, which works
    correctly in both Cargo and Bazel builds.
    - The templates only use simple `{{ variable }}` substitution with no
    control flow or filters, so no askama functionality is lost.
    - Removes the `askama` dependency from `codex-core` since it was the
    only crate using it. The workspace-level dependency definition is left
    in place.
    - This matches the existing pattern used throughout the codebase — e.g.
    `codex-rs/core/src/memories/mod.rs` already uses
    `include_str!("../../templates/memories/stage_one_system.md")` for the
    fourth template file.
    
    ## Test plan
    
    - [ ] Verify Bazel (experimental) CI passes on all platforms
    - [ ] Verify rust-ci (Cargo) builds and tests continue to pass
    - [ ] Verify `cargo test -p codex-core` passes locally
  • app-server tests: reduce intermittent nextest LEAK via graceful child shutdown (#12266)
    ## Why
    `cargo nextest` was intermittently reporting `LEAK` for
    `codex-app-server` tests even when assertions passed. This adds noise
    and flakiness to local/CI signals.
    
    Sample output used as the basis of this investigation:
    
    ```text
    LEAK [   7.578s] ( 149/3663) codex-app-server::all suite::output_schema::send_user_turn_output_schema_is_per_turn_v1
    LEAK [   7.383s] ( 210/3663) codex-app-server::all suite::v2::dynamic_tools::dynamic_tool_call_round_trip_sends_text_content_items_to_model
    LEAK [   7.768s] ( 213/3663) codex-app-server::all suite::v2::dynamic_tools::thread_start_injects_dynamic_tools_into_model_requests
    LEAK [   8.841s] ( 224/3663) codex-app-server::all suite::v2::output_schema::turn_start_accepts_output_schema_v2
    LEAK [   8.151s] ( 225/3663) codex-app-server::all suite::v2::plan_item::plan_mode_uses_proposed_plan_block_for_plan_item
    LEAK [   8.230s] ( 232/3663) codex-app-server::all suite::v2::safety_check_downgrade::openai_model_header_mismatch_emits_model_rerouted_notification_v2
    LEAK [   6.472s] ( 273/3663) codex-app-server::all suite::v2::turn_start::turn_start_accepts_collaboration_mode_override_v2
    LEAK [   6.107s] ( 275/3663) codex-app-server::all suite::v2::turn_start::turn_start_accepts_personality_override_v2
    ```
    
    ## How I Reproduced
    I focused on the suspect tests and ran them under `nextest` stress mode
    with leak reporting enabled.
    
    ```bash
    cargo nextest run -p codex-app-server -j 2 --no-fail-fast --stress-count 25 --status-level leak --final-status-level fail -E 'test(suite::output_schema::send_user_turn_output_schema_is_per_turn_v1) | test(suite::v2::dynamic_tools::dynamic_tool_call_round_trip_sends_text_content_items_to_model) | test(suite::v2::dynamic_tools::thread_start_injects_dynamic_tools_into_model_requests) | test(suite::v2::output_schema::turn_start_accepts_output_schema_v2) | test(suite::v2::plan_item::plan_mode_uses_proposed_plan_block_for_plan_item) | test(suite::v2::safety_check_downgrade::openai_model_header_mismatch_emits_model_rerouted_notification_v2) | test(suite::v2::turn_start::turn_start_accepts_collaboration_mode_override_v2) | test(suite::v2::turn_start::turn_start_accepts_personality_override_v2)'
    ```
    
    This reproduced intermittent `LEAK` statuses while tests still passed.
    
    ## What Changed
    In `codex-rs/app-server/tests/common/mcp_process.rs`:
    
    - Changed `stdin: ChildStdin` to `stdin: Option<ChildStdin>` so teardown
    can explicitly close stdin.
    - In `Drop`, close stdin first to trigger EOF-based graceful shutdown.
    - Wait briefly for graceful exit.
    - If still running, fall back to `start_kill()` and the existing bounded
    `try_wait()` loop.
    - Updated send-path handling to bail if stdin is already closed.
    
    ## Why This Is the Right Fix
    The leak signal was caused by child-process teardown timing, not
    test-logic assertion failure. The helper previously relied mostly on
    force-kill timing in `Drop`; that can race with nextest leak detection.
    
    Closing stdin first gives `codex-app-server` a deterministic, graceful
    shutdown path before force-kill. Keeping the force-kill fallback
    preserves robustness if graceful shutdown does not complete in time.
    
    ## Verification
    - `cargo test -p codex-app-server`
    - Re-ran the stress repro above after this change: no `LEAK` statuses
    observed.
    - Additional high-signal stress run also showed no leaks:
    
    ```bash
    cargo nextest run -p codex-app-server -j 2 --no-fail-fast --stress-count 100 --status-level leak --final-status-level fail -E 'test(suite::output_schema::send_user_turn_output_schema_is_per_turn_v1) | test(suite::v2::dynamic_tools::dynamic_tool_call_round_trip_sends_text_content_items_to_model)'
    ```
  • Clarify cumulative proposed_plan behavior in Plan mode (#12265)
    ## Summary
    - Require revised `<proposed_plan>` blocks in the same planning session
    to be complete replacements, not partial/delta plans.
    - Scope that cumulative replacement rule to the current planning session
    only.
    - Clarify that after leaving Plan mode (for example switching to Default
    mode to implement) or when explicitly asked for a new plan, the model
    should produce a new self-contained plan without inheriting prior plan
    blocks unless requested.
    
    ## Testing
    - Not run (prompt/template text-only change).
  • Skip removed features during metrics emission (#12253)
    Summary
    - avoid emitting metrics for features marked as `Stage::Removed`
    - keep feature metrics aligned with active and planned states only
    
    Testing
    - Not run (not requested)
  • feat: add Reject approval policy with granular prompt rejection controls (#12087)
    ## Why
    
    We need a way to auto-reject specific approval prompt categories without
    switching all approvals off.
    
    The goal is to let users independently control:
    - sandbox escalation approvals,
    - execpolicy `prompt` rule approvals,
    - MCP elicitation prompts.
    
    ## What changed
    
    - Added a new primary approval mode in `protocol/src/protocol.rs`:
    
    ```rust
    pub enum AskForApproval {
        // ...
        Reject(RejectConfig),
        // ...
    }
    
    pub struct RejectConfig {
        pub sandbox_approval: bool,
        pub rules: bool,
        pub mcp_elicitations: bool,
    }
    ```
    
    - Wired `RejectConfig` semantics through approval paths in `core`:
      - `core/src/exec_policy.rs`
        - rejects rule-driven prompts when `rules = true`
        - rejects sandbox/escalation prompts when `sandbox_approval = true`
    - preserves rule priority when both rule and sandbox prompt conditions
    are present
      - `core/src/tools/sandboxing.rs`
    - applies `sandbox_approval` to default exec approval decisions and
    sandbox-failure retry gating
      - `core/src/safety.rs`
    - keeps `Reject { all false }` behavior aligned with `OnRequest` for
    patch safety
        - rejects out-of-root patch approvals when `sandbox_approval = true`
      - `core/src/mcp_connection_manager.rs`
        - auto-declines MCP elicitations when `mcp_elicitations = true`
    
    - Ensured approval policy used by MCP elicitation flow stays in sync
    with constrained session policy updates.
    
    - Updated app-server v2 conversions and generated schema/TypeScript
    artifacts for the new `Reject` shape.
    
    ## Verification
    
    Added focused unit coverage for the new behavior in:
    - `core/src/exec_policy.rs`
    - `core/src/tools/sandboxing.rs`
    - `core/src/mcp_connection_manager.rs`
    - `core/src/safety.rs`
    - `core/src/tools/runtimes/apply_patch.rs`
    
    Key cases covered include rule-vs-sandbox prompt precedence, MCP
    auto-decline behavior, and patch/sandbox retry behavior under
    `RejectConfig`.
  • chore: consolidate new() and initialize() for McpConnectionManager (#12255)
    ## Why
    `McpConnectionManager` used a two-phase setup (`new()` followed by
    `initialize()`), which forced call sites to construct placeholder state
    and then mutate it asynchronously. That made MCP startup/refresh flows
    harder to follow and easier to misuse, especially around cancellation
    token ownership.
    
    ## What changed
    - Replaced the two-phase initialization flow with a single async
    constructor: `McpConnectionManager::new(...) -> (Self,
    CancellationToken)`.
    - Added `McpConnectionManager::new_uninitialized()` for places that need
    an empty manager before async startup begins.
    - Added `McpConnectionManager::new_mcp_connection_manager_for_tests()`
    for test-only construction.
    - Updated MCP startup and refresh call sites in
    `codex-rs/core/src/codex.rs` to build a fresh manager via `new(...)`,
    swap it in, and update the startup cancellation token consistently.
    - Updated MCP snapshot/connector call sites in
    `codex-rs/core/src/mcp/mod.rs` and `codex-rs/core/src/connectors.rs` to
    use the consolidated constructor.
    - Removed the now-obsolete `reset_mcp_startup_cancellation_token()`
    helper in favor of explicit token replacement at the call sites.
    
    ## Testing
    - Not run (refactor-only change; no new behavior was intended).
  • Add configurable agent spawn depth (#12251)
    Summary
    - expose `agents.max_depth` in config schema and toml parsing, with
    defaults and validation
    - thread-spawn depth guards and multi-agent handler now respect the
    configured limit instead of a hardcoded value
    - ensure documentation and helpers account for agent depth limits
  • client side modelinfo overrides (#12101)
    TL;DR
    Add top-level `model_catalog_json` config support so users can supply a
    local model catalog override from a JSON file path (including adding new
    models) without backend changes.
    
    ### Problem
    Codex previously had no clean client-side way to replace/overlay model
    catalog data for local testing of model metadata and new model entries.
    
    ### Fix
    - Add top-level `model_catalog_json` config field (JSON file path).
    - Apply catalog entries when resolving `ModelInfo`:
      1. Base resolved model metadata (remote/fallback)
      2. Catalog overlay from `model_catalog_json`
    3. Existing global top-level overrides (`model_context_window`,
    `model_supports_reasoning_summaries`, etc.)
    
    ### Note
    Will revisit per-field overrides in a follow-up
    
    ### Tests
    Added tests
  • Move previous turn context tracking into ContextManager history (#12179)
    ## Summary
    - add `previous_context_item: Option<TurnContextItem>` to
    `ContextManager`
    - expose session/state accessors for reading and updating the stored
    previous context item
    - switch settings diffing to use `TurnContextItem` instead of
    `TurnContext`
    - remove submission-loop local `previous_context` and persist the
    previous context item in history
    
    ## Testing
    - `just fmt`
    - `just fix -p codex-core`
    - `cargo test -p codex-core --test all model_switching::`
    - `cargo test -p codex-core --test all collaboration_instructions::`
    - `cargo test -p codex-core --test all personality::`
    - `cargo test -p codex-core --test all
    permissions_messages::permissions_message_not_added_when_no_change`
  • Adjust MCP tool approval handling for custom servers (#11787)
    Summary
    This PR expands MCP client-side approval behavior beyond codex_apps and
    tightens elicitation capability signaling.
    
    - Removed the codex_apps-only gate in MCP tool approval checks, so
    local/custom MCP servers are now eligible for the same client-side
    approval prompt flow when tool annotations indicate side effects.
    - Updated approval memory keying to support tools without a connector ID
    (connector_id: Option<String>), allowing “Approve this Session” to be
    remembered even when connector metadata is missing.
    - Updated prompt text for non-codex_apps tools to identify origin as The
    <server> MCP server instead of This app.
    - Added MCP initialization capability policy so only codex_apps
    advertises MCP elicitation capability; other servers advertise no
    elicitation support.
    - Added regression tests for:
    server-specific prompt copy behavior
    codex-apps-only elicitation capability advertisement
    
    Testing
    - Not run (not requested)
  • feat: add configurable write_stdin timeout (#12228)
    Add max timeout as config for `write_stdin`. This is only used for empty
    `write_stdin`.
    
    Also increased the default value from 30s to 5mins.
  • docs: add codex security policy (#12193)
    ## Summary
    Adds SECURITY.MD with Codex security policy and Bugcrowd reporting
    guidance
  • feat: sub-agent injection (#12152)
    This PR adds parent-thread sub-agent completion notifications and change
    the prompt of the model to prevent if from being confused
  • Adjust memories rollout defaults (#12231)
    - Summary
    - raise `DEFAULT_MEMORIES_MAX_ROLLOUTS_PER_STARTUP` to 16 so more
    rollouts are allowed per startup
    - lower `DEFAULT_MEMORIES_MIN_ROLLOUT_IDLE_HOURS` to 6 to make rollouts
    eligible sooner
    - Testing
      - Not run (not requested)
  • Update docs links for feature flag notice (#12164)
    Summary
    - replace the stale `docs/config.md#feature-flags` reference in the
    legacy feature notice with the canonical published URL
    - align the deprecation notice test to expect the new link
    
    This addresses #12123
  • fix(linux-sandbox): mount /dev in bwrap sandbox (#12081)
    ## Summary
    - Updates the Linux bubblewrap sandbox args to mount a minimal `/dev`
    using `--dev /dev` instead of only binding `/dev/null`. tools needing
    entropy (git, crypto libs, etc.) can fail.
    
    - Changed mount order so `--dev /dev` is added before writable-root
    `--bind` mounts, preserving writable `/dev/*` submounts like `/dev/shm`
    
    ## Why
    Fixes sandboxed command failures when reading `/dev/urandom` (and
    similar standard device-node access).
    
    
    Fixes https://github.com/openai/codex/issues/12056
  • [apps] Update apps allowlist. (#12211)
    - [x] Update apps allowlist.
  • Stabilize app-server detached review and running-resume tests (#12203)
    ## Summary
    - stabilize
    `thread_resume_rejoins_running_thread_even_with_override_mismatch` by
    using a valid delayed second SSE response instead of an intentionally
    truncated stream
    - set `RUST_MIN_STACK=4194304` for spawned app-server test processes in
    `McpProcess` to avoid stack-sensitive CI overflows in detached review
    tests
    
    ## Why
    - the thread-resume assertion could race with a mocked stream-disconnect
    error and intermittently observe `systemError`
    - detached review startup is stack-sensitive in some CI environments;
    pinning a larger stack in the test harness removes that flake without
    changing product behavior
    
    ## Validation
    - `just fmt`
    - `cargo test -p codex-app-server --test all
    suite::v2::thread_resume::thread_resume_rejoins_running_thread_even_with_override_mismatch`
    - `cargo test -p codex-app-server --test all
    suite::v2::review::review_start_with_detached_delivery_returns_new_thread_id`
  • state: enforce 10 MiB log caps for thread and threadless process logs (#12038)
    ## Summary
    - enforce a 10 MiB cap per `thread_id` in state log storage
    - enforce a 10 MiB cap per `process_uuid` for threadless (`thread_id IS
    NULL`) logs
    - scope pruning to only keys affected by the current insert batch
    - add a cheap per-key `SUM(...)` precheck so windowed prune queries only
    run for keys that are currently over the cap
    - add SQLite indexes used by the pruning queries
    - add focused runtime tests covering both pruning behaviors
    
    ## Why
    This keeps log growth bounded by the intended partition semantics while
    preserving a small, readable implementation localized to the existing
    insert path.
    
    ## Local Latency Snapshot (No Truncation-Pressure Run)
    Collected from session `019c734f-1d16-7002-9e00-c966c9fbbcae` using
    local-only (uncommitted) instrumentation, while not specifically
    benchmarking the truncation-heavy regime.
    
    ### Percentiles By Query (ms)
    | query | count | p50 | p90 | p95 | p99 | max |
    |---|---:|---:|---:|---:|---:|---:|
    | `insert_logs.insert_batch` | 110 | 0.332 | 0.999 | 1.811 | 2.978 |
    3.493 |
    | `insert_logs.precheck.process` | 106 | 0.074 | 0.152 | 0.206 | 0.258 |
    0.426 |
    | `insert_logs.precheck.thread` | 73 | 0.118 | 0.206 | 0.253 | 1.025 |
    1.025 |
    | `insert_logs.prune.process` | 58 | 0.291 | 0.576 | 0.607 | 1.088 |
    1.088 |
    | `insert_logs.prune.thread` | 44 | 0.318 | 0.467 | 0.728 | 0.797 |
    0.797 |
    | `insert_logs.prune_total` | 110 | 0.488 | 0.976 | 1.237 | 1.593 |
    1.684 |
    | `insert_logs.total` | 110 | 1.315 | 2.889 | 3.623 | 5.739 | 5.961 |
    | `insert_logs.tx_begin` | 110 | 0.133 | 0.235 | 0.282 | 0.412 | 0.546 |
    | `insert_logs.tx_commit` | 110 | 0.259 | 0.689 | 0.772 | 1.065 | 1.080
    |
    
    ### `insert_logs.total` Histogram (ms)
    | bucket | count |
    |---|---:|
    | `<= 0.100` | 0 |
    | `<= 0.250` | 0 |
    | `<= 0.500` | 7 |
    | `<= 1.000` | 33 |
    | `<= 2.000` | 40 |
    | `<= 5.000` | 28 |
    | `<= 10.000` | 2 |
    | `<= 20.000` | 0 |
    | `<= 50.000` | 0 |
    | `<= 100.000` | 0 |
    | `> 100.000` | 0 |
    
    ## Local Latency Snapshot (Truncation-Heavy / Cap-Hit Regime)
    Collected from a run where cap-hit behavior was frequent (`135/180`
    insert calls), using local-only (uncommitted) instrumentation and a
    temporary local cap of `10_000` bytes for stress testing (not the merged
    `10 MiB` cap).
    
    ### Percentiles By Query (ms)
    | query | count | p50 | p90 | p95 | p99 | max |
    |---|---:|---:|---:|---:|---:|---:|
    | `insert_logs.insert_batch` | 180 | 0.524 | 1.645 | 2.163 | 3.424 |
    3.777 |
    | `insert_logs.precheck.process` | 171 | 0.086 | 0.235 | 0.373 | 0.758 |
    1.147 |
    | `insert_logs.precheck.thread` | 100 | 0.105 | 0.251 | 0.291 | 1.176 |
    1.622 |
    | `insert_logs.prune.process` | 109 | 0.386 | 0.839 | 1.146 | 1.548 |
    2.588 |
    | `insert_logs.prune.thread` | 56 | 0.253 | 0.550 | 1.148 | 2.484 |
    2.484 |
    | `insert_logs.prune_total` | 180 | 0.511 | 1.221 | 1.695 | 4.548 |
    5.512 |
    | `insert_logs.total` | 180 | 1.631 | 3.902 | 5.103 | 8.901 | 9.095 |
    | `insert_logs.total_cap_hit` | 135 | 1.876 | 4.501 | 5.547 | 8.902 |
    9.096 |
    | `insert_logs.total_no_cap_hit` | 45 | 0.520 | 1.700 | 2.079 | 3.294 |
    3.294 |
    | `insert_logs.tx_begin` | 180 | 0.109 | 0.253 | 0.287 | 1.088 | 1.406 |
    | `insert_logs.tx_commit` | 180 | 0.267 | 0.813 | 1.170 | 2.497 | 2.574
    |
    
    ### `insert_logs.total` Histogram (ms)
    | bucket | count |
    |---|---:|
    | `<= 0.100` | 0 |
    | `<= 0.250` | 0 |
    | `<= 0.500` | 16 |
    | `<= 1.000` | 39 |
    | `<= 2.000` | 60 |
    | `<= 5.000` | 54 |
    | `<= 10.000` | 11 |
    | `<= 20.000` | 0 |
    | `<= 50.000` | 0 |
    | `<= 100.000` | 0 |
    | `> 100.000` | 0 |
    
    ### `insert_logs.total` Histogram When Cap Was Hit (ms)
    | bucket | count |
    |---|---:|
    | `<= 0.100` | 0 |
    | `<= 0.250` | 0 |
    | `<= 0.500` | 0 |
    | `<= 1.000` | 22 |
    | `<= 2.000` | 51 |
    | `<= 5.000` | 51 |
    | `<= 10.000` | 11 |
    | `<= 20.000` | 0 |
    | `<= 50.000` | 0 |
    | `<= 100.000` | 0 |
    | `> 100.000` | 0 |
    
    ### Performance Takeaways
    - Even in a cap-hit-heavy run (`75%` cap-hit calls), `insert_logs.total`
    stays sub-10ms at p99 (`8.901ms`) and max (`9.095ms`).
    - Calls that did **not** hit the cap are materially cheaper
    (`insert_logs.total_no_cap_hit` p95 `2.079ms`) than cap-hit calls
    (`insert_logs.total_cap_hit` p95 `5.547ms`).
    - Compared to the earlier non-truncation-pressure run, overall
    `insert_logs.total` rose from p95 `3.623ms` to p95 `5.103ms`
    (+`1.48ms`), indicating bounded overhead when pruning is active.
    - This truncation-heavy run used an intentionally low local cap for
    stress testing; with the real 10 MiB cap, cap-hit frequency should be
    much lower in normal sessions.
    
    ## Testing
    - `just fmt` (in `codex-rs`)
    - `cargo test -p codex-state` (in `codex-rs`)
  • app-server: expose loaded thread status via read/list and notifications (#11786)
    Motivation
    - Today, a newly connected client has no direct way to determine the
    current runtime status of threads from read/list responses alone.
    - This forces clients to infer state from transient events, which can
    lead to stale or inconsistent UI when reconnecting or attaching late.
    
    Changes
    - Add `status` to `thread/read` responses.
    - Add `statuses` to `thread/list` responses.
    - Emit `thread/status/changed` notifications with `threadId` and the new
    status.
    - Track runtime status for all loaded threads and default unknown
    threads to `idle`.
    - Update protocol/docs/tests/schema fixtures for the revised API.
    
    Testing
    - Validated protocol API changes with automated protocol tests and
    regenerated schema/type fixtures.
    - Validated app-server behavior with unit and integration test suites,
    including status transitions and notifications.
  • [apps] Temporary app block. (#12180)
    - [x] Temporary app block.
  • fix: Remove citation (#12187)
    Remove citation requirement until we figure out a better visualization
  • app-server support for Windows sandbox setup. (#12025)
    app-server support for initiating Windows sandbox setup.
    server responds quickly to setup request and makes a future RPC call
    back to client when the setup finishes.
    
    The TUI implementation is unaffected but in a future PR I'll update the
    TUI to use the shared setup helper
    (`windows_sandbox.run_windows_sandbox_setup`)
  • js_repl: canonicalize paths for node_modules boundary checks (#12177)
    ## Summary
    
    Fix `js_repl` package-resolution boundary checks for macOS temp
    directory path aliasing (`/var` vs `/private/var`).
    
    ## Problem
    
    `js_repl` verifies that resolved bare-package imports stay inside a
    configured `node_modules` root.
    On macOS, temp directories are commonly exposed as `/var/...` but
    canonicalize to `/private/var/...`.
    Because the boundary check compared raw paths with `path.relative(...)`,
    valid resolutions under temp dirs could be misclassified as escaping the
    allowed base, causing false `Module not found` errors.
    
    ## Changes
    
    - Add `fs` import in the JS kernel.
    - Add `canonicalizePath()` using `fs.realpathSync.native(...)` (with
    safe fallback).
    - Canonicalize both `base` and `resolvedPath` before running the
    `node_modules` containment check.
    
    ## Impact
    
    - Fixes false-negative boundary checks for valid package resolutions in
    macOS temp-dir scenarios.
    - Keeps the existing security boundary behavior intact.
    - Scope is limited to `js_repl` kernel module path validation logic.
    
    
    
    #### [git stack](https://github.com/magus/git-stack-cli)
    - 👉 `1` https://github.com/openai/codex/pull/12177
    -  `2` https://github.com/openai/codex/pull/10673
  • memories: bump rollout summary slug cap to 60 (#12167)
    ## Summary
    Increase the rollout summary filename slug cap from 20 to 60 characters
    in memory storage.
    
    ## What changed
    - Updated `ROLLOUT_SLUG_MAX_LEN` from `20` to `60` in:
      - `codex-rs/core/src/memories/storage.rs`
    - Updated slug truncation test to verify 60-char behavior.
    
    ## Why
    This preserves more semantic context in rollout summary filenames while
    keeping existing normalization behavior unchanged.
    
    ## Testing
    - `just fmt`
    - `cargo test -p codex-core
    memories::storage::tests::rollout_summary_file_stem_sanitizes_and_truncates_slug
    -- --exact`
  • fix: file watcher (#12105)
    The issue was that the file_watcher never unsubscribe a file watch. All
    of them leave in the owning of the ThreadManager. As a result, for each
    newly created thread we create a new file watcher but this one never get
    deleted even if we close the thread. On Unix system, a file watcher uses
    an `inotify` and after some time we end up having consumed all of them.
    
    This PR adds a mechanism to unsubscribe a file watcher when a thread is
    dropped
  • Fixed a hole in token refresh logic for app server (#11802)
    We've continued to receive reports from users that they're seeing the
    error message "Your access token could not be refreshed because your
    refresh token was already used. Please log out and sign in again." This
    PR fixes two holes in the token refresh logic that lead to this
    condition.
    
    Background: A previous change in token refresh introduced the
    `UnauthorizedRecovery` object. It implements a state machine in the core
    agent loop that first performs a load of the on-disk auth information
    guarded by a check for matching account ID. If it finds that the on-disk
    version has been updated by another instance of codex, it uses the
    reloaded auth tokens. If the on-disk version hasn't been updated, it
    issues a refresh request from the token authority.
    
    There are two problems that this PR addresses:
    
    Problem 1: We weren't doing the same thing for the code path used by the
    app server interface. This PR effectively replicates the
    `UnauthorizedRecovery` logic for that code path.
    
    Problem 2: The `UnauthorizedRecovery` logic contained a hole in the
    `ReloadOutcome::Skipped` case. Here's the scenario. A user starts two
    instances of the CLI. Instance 1 is active (working on a task), instance
    2 is idle. Both instances have the same in-memory cached tokens. The
    user then runs `codex logout` or `codex login` to log in to a separate
    account, which overwrites the `auth.json` file. Instance 1 receives a
    401 and refreshes its token, but it doesn't write the new token to the
    `auth.json` file because the account ID doesn't match. Instance 2 is
    later activated and presented with a new task. It immediately hits a 401
    and attempts to refresh its token but fails because its cached refresh
    token is now invalid. To avoid this situation, I've changed the logic to
    immediately fail a token refresh if the user has since logged out or
    logged in to another account. This will still be seen as an error by the
    user, but the cause will be clearer.
    
    I also took this opportunity to clean up the names of existing functions
    to make their roles clearer.
    * `try_refresh_token` is renamed `request_chatgpt_token_refresh`
    * the existing `refresh_token` is renamed `refresh_token_from_authority`
    (there's a new higher-level function named `refresh_token` now)
    * `refresh_tokens` is renamed `refresh_and_persist_chatgpt_token`, and
    it now implicitly reloads
    * `update_tokens` is renamed `persist_tokens`
  • Disable collab tools during review delegation (#12157)
    Summary
    - prevent delegated review agents from re-enabling blocked tools by
    explicitly disabling the Collab feature alongside web search and view
    image controls
    
    Testing
    - Not run (not requested)
  • Stop filtering model tools in js_repl_tools_only mode (#12069)
    ## Summary
    This change removes tool-list filtering in `js_repl_tools_only` mode and
    relies on the normal model tool descriptions, while still enforcing that
    tool execution must go through `js_repl` + `codex.tool(...)`.
    
    ## Motivation
    The previous `js_repl_tools_only` filtering hid most tools from the
    model request, which diverged from standard tool-list behavior and made
    signatures less discoverable. I tested that this filtering is not
    needed, and the model can follow the prompt to only call tools via
    `js_repl`.
    
    ## What Changed
    - `filter_tools_for_model(...)` in `core/src/tools/spec.rs` is now a
    pass-through (no filtering when `js_repl_tools_only` is enabled).
    - Updated tests to assert that model tools are not filtered in
    `js_repl_tools_only` mode.
    - Updated dynamic-tool test to assert dynamic tools remain visible in
    model tool specs.
    - Removed obsolete test helper used only by the old filtering
    assertions.
    
    ## Safety / Behavior
    - This commit does **not** relax execution policy.
    - Direct model tool calls remain blocked in `js_repl_tools_only` mode
    (except internal `js_repl` tools), and callers are instructed to use
    `js_repl` + `codex.tool(...)`.
    
    ## Testing
    - `cargo test -p codex-core js_repl_tools_only`
    - Manual rollout validation showed the model can follow the `js_repl`
    routing instructions without needing filtered tool lists.
    
    
    
    #### [git stack](https://github.com/magus/git-stack-cli)
    - 👉 `1` https://github.com/openai/codex/pull/12069
    -  `2` https://github.com/openai/codex/pull/10673
    -  `3` https://github.com/openai/codex/pull/10670