Commit Graph

49 Commits

  • Add remote env CI matrix and integration test (#14869)
    `CODEX_TEST_REMOTE_ENV` will make `test_codex` start the executor
    "remotely" (inside a docker container) turning any integration test into
    remote test.
  • Split features into codex-features crate (#15253)
    - Split the feature system into a new `codex-features` crate.
    - Cut `codex-core` and workspace consumers over to the new config and
    warning APIs.
    
    Co-authored-by: Ahmed Ibrahim <219906144+aibrahim-oai@users.noreply.github.com>
    Co-authored-by: Codex <noreply@openai.com>
  • Prefer websockets when providers support them (#13592)
    Remove all flags and model settings.
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • Stabilize Windows cmd-based shell test harnesses (#14958)
    ## What is flaky
    The Windows shell-driven integration tests in `codex-rs/core` were
    intermittently unstable, especially:
    
    - `apply_patch_cli_can_use_shell_command_output_as_patch_input`
    - `websocket_test_codex_shell_chain`
    - `websocket_v2_test_codex_shell_chain`
    
    ## Why it was flaky
    These tests were exercising real shell-tool flows through whichever
    shell Codex selected on Windows, and the `apply_patch` test also nested
    a PowerShell read inside `cmd /c`.
    
    There were multiple independent sources of nondeterminism in that setup:
    
    - The test harness depended on the model-selected Windows shell instead
    of pinning the shell it actually meant to exercise.
    - `cmd.exe /c powershell.exe -Command "..."` is quoting-sensitive; on CI
    that could leave the read command wrapped as a literal string instead of
    executing it.
    - Even after getting the quoting right, PowerShell could emit CLIXML
    progress records like module-initialization output onto stdout.
    - The `apply_patch` test was building a patch directly from shell
    stdout, so any quoting artifact or progress noise corrupted the patch
    input.
    
    So the failures were driven by shell startup and output-shape variance,
    not by the `apply_patch` or websocket logic themselves.
    
    ## How this PR fixes it
    - Add a test-only `user_shell_override` path so Windows integration
    tests can pin `cmd.exe` explicitly.
    - Use that override in the websocket shell-chain tests and in the
    `apply_patch` harness.
    - Change the nested Windows file read in
    `apply_patch_cli_can_use_shell_command_output_as_patch_input` to a UTF-8
    PowerShell `-EncodedCommand` script.
    - Run that nested PowerShell process with `-NonInteractive`, set
    `$ProgressPreference = 'SilentlyContinue'`, and read the file with
    `[System.IO.File]::ReadAllText(...)`.
    
    ## Why this fix fixes the flakiness
    The outer harness now runs under a deterministic shell, and the inner
    PowerShell read no longer depends on fragile `cmd` quoting or on
    progress output staying quiet by accident. The shell tool returns only
    the file contents, so patch construction and websocket assertions depend
    on stable test inputs instead of on runner-specific shell behavior.
    
    ---------
    
    Co-authored-by: Ahmed Ibrahim <219906144+aibrahim-oai@users.noreply.github.com>
    Co-authored-by: Codex <noreply@openai.com>
  • Apply argument comment lint across codex-rs (#14652)
    ## Why
    
    Once the repo-local lint exists, `codex-rs` needs to follow the
    checked-in convention and CI needs to keep it from drifting. This commit
    applies the fallback `/*param*/` style consistently across existing
    positional literal call sites without changing those APIs.
    
    The longer-term preference is still to avoid APIs that require comments
    by choosing clearer parameter types and call shapes. This PR is
    intentionally the mechanical follow-through for the places where the
    existing signatures stay in place.
    
    After rebasing onto newer `main`, the rollout also had to cover newly
    introduced `tui_app_server` call sites. That made it clear the first cut
    of the CI job was too expensive for the common path: it was spending
    almost as much time installing `cargo-dylint` and re-testing the lint
    crate as a representative test job spends running product tests. The CI
    update keeps the full workspace enforcement but trims that extra
    overhead from ordinary `codex-rs` PRs.
    
    ## What changed
    
    - keep a dedicated `argument_comment_lint` job in `rust-ci`
    - mechanically annotate remaining opaque positional literals across
    `codex-rs` with exact `/*param*/` comments, including the rebased
    `tui_app_server` call sites that now fall under the lint
    - keep the checked-in style aligned with the lint policy by using
    `/*param*/` and leaving string and char literals uncommented
    - cache `cargo-dylint`, `dylint-link`, and the relevant Cargo
    registry/git metadata in the lint job
    - split changed-path detection so the lint crate's own `cargo test` step
    runs only when `tools/argument-comment-lint/*` or `rust-ci.yml` changes
    - continue to run the repo wrapper over the `codex-rs` workspace, so
    product-code enforcement is unchanged
    
    Most of the code changes in this commit are intentionally mechanical
    comment rewrites or insertions driven by the lint itself.
    
    ## Verification
    
    - `./tools/argument-comment-lint/run.sh --workspace`
    - `cargo test -p codex-tui-app-server -p codex-tui`
    - parsed `.github/workflows/rust-ci.yml` locally with PyYAML
    
    ---
    
    * -> #14652
    * #14651
  • Add openai_base_url config override for built-in provider (#12031)
    We regularly get bug reports from users who mistakenly have the
    `OPENAI_BASE_URL` environment variable set. This PR deprecates this
    environment variable in favor of a top-level config key
    `openai_base_url` that is used for the same purpose. By making it a
    config key, it will be more visible to users. It will also participate
    in all of the infrastructure we've added for layered and managed
    configs.
    
    Summary
    - introduce the `openai_base_url` top-level config key, update
    schema/tests, and route the built-in openai provider through it while
    - fall back to deprecated `OPENAI_BASE_URL` env var but warn user of
    deprecation when no `openai_base_url` config key is present
    - update CLI, SDK, and TUI code to prefer the new config path (with a
    deprecated env-var fallback) and document the SDK behavior change
  • feat(app-server): propagate traces across tasks and core ops (#14387)
    ## Summary
    
    This PR keeps app-server RPC request trace context alive for the full
    lifetime of the work that request kicks off (e.g. for `thread/start`,
    this is `app-server rpc handler -> tokio background task -> core op
    submissions`). Previously we lose trace lineage once the request handler
    returns or hands work off to background tasks.
    
    This approach is especially relevant for `thread/start` and other RPC
    handlers that run in a non-blocking way. In the near future we'll most
    likely want to make all app-server handlers run in a non-blocking way by
    default, and only queue operations that must operate in order (e.g.
    thread RPCs per thread?), so we want to make sure tracing in app-server
    just generally works.
    
    Depends on https://github.com/openai/codex/pull/14300
    
    **Before**
    <img width="155" height="207" alt="image"
    src="https://github.com/user-attachments/assets/c9487459-36f1-436c-beb7-fafeb40737af"
    />
    
    
    **After**
    <img width="299" height="337" alt="image"
    src="https://github.com/user-attachments/assets/727392b2-d072-4427-9dc4-0502d8652dea"
    />
    
    ## What changed
    
    - Keep request-scoped trace context around until we send the final
    response or error, or the connection closes.
    - Thread that trace context through detached `thread/start` work so
    background startup stays attached to the originating request.
    - Pass request trace context through to downstream core operations,
    including:
      - thread creation
      - resume/fork flows
      - turn submission
      - review
      - interrupt
      - realtime conversation operations
    - Add tracing tests that verify:
      - remote W3C trace context is preserved for `thread/start`
      - remote W3C trace context is preserved for `turn/start`
      - downstream core spans stay under the originating request span
      - request-scoped tracing state is cleaned up correctly
    - Clean up shutdown behavior so detached background tasks and spawned
    threads are drained before process exit.
  • feat: support disabling bundled system skills (#13792)
    Support disable bundled system skills with a config:
    
    [skills.bundled]
    enabled = false
  • core: box wrapper futures to reduce stack pressure (#13429)
    Follow-up to [#13388](https://github.com/openai/codex/pull/13388). This
    uses the same general fix pattern as
    [#12421](https://github.com/openai/codex/pull/12421), but in the
    `codex-core` compact/resume/fork path.
    
    ## Why
    
    `compact_resume_after_second_compaction_preserves_history` started
    overflowing the stack on Windows CI after `#13388`.
    
    The important part is that this was not a compaction-recursion bug. The
    test exercises a path with several thin `async fn` wrappers around much
    larger thread-spawn, resume, and fork futures. When one `async fn`
    awaits another inline, the outer future stores the callee future as part
    of its own state machine. In a long wrapper chain, that means a caller
    can accidentally inline a lot more state than the source code suggests.
    
    That is exactly what was happening here:
    
    - `ThreadManager` convenience methods such as `start_thread`,
    `resume_thread_from_rollout`, and `fork_thread` were inlining the larger
    spawn/resume futures beneath them.
    - `core_test_support::test_codex` added another wrapper layer on top of
    those same paths.
    - `compact_resume_fork` adds a few more helpers, and this particular
    test drives the resume/fork path multiple times.
    
    On Windows, that was enough to push both the libtest thread and Tokio
    worker threads over the edge. The previous 8 MiB test-thread workaround
    proved the failure was stack-related, but it did not address the
    underlying future size.
    
    ## How This Was Debugged
    
    The useful debugging pattern here was to turn the CI-only failure into a
    local low-stack repro.
    
    1. First, remove the explicit large-stack harness so the test runs on
    the normal `#[tokio::test]` path.
    2. Build the test binary normally.
    3. Re-run the already-built `tests/all` binary directly with
    progressively smaller `RUST_MIN_STACK` values.
    
    Running the built binary directly matters: it keeps the reduced stack
    size focused on the test process instead of also applying it to `cargo`
    and `rustc`.
    
    That made it possible to answer two questions quickly:
    
    - Does the failure still reproduce without the workaround? Yes.
    - Does boxing the wrapper futures actually buy back stack headroom? Also
    yes.
    
    After this change, the built test binary passes with
    `RUST_MIN_STACK=917504` and still overflows at `786432`, which is enough
    evidence to justify removing the explicit 8 MiB override while keeping a
    deterministic low-stack repro for future debugging.
    
    If we hit a similar issue again, the first places to inspect are thin
    `async fn` wrappers that mostly forward into a much larger async
    implementation.
    
    ## `Box::pin()` Primer
    
    `async fn` compiles into a state machine. If a wrapper does this:
    
    ```rust
    async fn wrapper() {
        inner().await;
    }
    ```
    
    then `wrapper()` stores the full `inner()` future inline as part of its
    own state.
    
    If the wrapper instead does this:
    
    ```rust
    async fn wrapper() {
        Box::pin(inner()).await;
    }
    ```
    
    then the child future lives on the heap, and the outer future only
    stores a pinned pointer to it. That usually trades one allocation for a
    substantially smaller outer future, which is exactly the tradeoff we
    want when the problem is stack pressure rather than raw CPU time.
    
    Useful references:
    
    -
    [`Box::pin`](https://doc.rust-lang.org/std/boxed/struct.Box.html#method.pin)
    - [Async book:
    Pinning](https://rust-lang.github.io/async-book/04_pinning/01_chapter.html)
    
    ## What Changed
    
    - Boxed the wrapper futures in `core/src/thread_manager.rs` around
    `start_thread`, `resume_thread_from_rollout`, `fork_thread`, and the
    corresponding `ThreadManagerState` spawn helpers so callers no longer
    inline the full spawn/resume state machine through multiple layers.
    - Boxed the matching test-only wrapper futures in
    `core/tests/common/test_codex.rs` and
    `core/tests/suite/compact_resume_fork.rs`, which sit directly on top of
    the same path.
    - Restored `compact_resume_after_second_compaction_preserves_history` in
    `core/tests/suite/compact_resume_fork.rs` to a normal `#[tokio::test]`
    and removed the explicit `TEST_STACK_SIZE_BYTES` thread/runtime sizing.
    - Simplified a tiny helper in `compact_resume_fork` by making
    `fetch_conversation_path()` synchronous, which removes one more
    unnecessary future layer from the test path.
    
    ## Verification
    
    - `cargo test -p codex-core --test all
    suite::compact_resume_fork::compact_resume_after_second_compaction_preserves_history
    -- --exact --nocapture`
    - `cargo test -p codex-core --test all suite::compact_resume_fork --
    --nocapture`
    - Re-ran the built `codex-core` `tests/all` binary directly with reduced
    stack sizes:
      - `RUST_MIN_STACK=917504` passes
      - `RUST_MIN_STACK=786432` still overflows
    - `cargo test -p codex-core`
    - Still fails locally in unrelated existing integration areas that
    expect the `codex` / `test_stdio_server` binaries or hit the existing
    `search_tool` wiremock mismatches.
  • config: enforce enterprise feature requirements (#13388)
    ## Why
    
    Enterprises can already constrain approvals, sandboxing, and web search
    through `requirements.toml` and MDM, but feature flags were still only
    configurable as managed defaults. That meant an enterprise could suggest
    feature values, but it could not actually pin them.
    
    This change closes that gap and makes enterprise feature requirements
    behave like the other constrained settings. The effective feature set
    now stays consistent with enterprise requirements during config load,
    when config writes are validated, and when runtime code mutates feature
    flags later in the session.
    
    It also tightens the runtime API for managed features. `ManagedFeatures`
    now follows the same constraint-oriented shape as `Constrained<T>`
    instead of exposing panic-prone mutation helpers, and production code
    can no longer construct it through an unconstrained `From<Features>`
    path.
    
    The PR also hardens the `compact_resume_fork` integration coverage on
    Windows. After the feature-management changes,
    `compact_resume_after_second_compaction_preserves_history` was
    overflowing the libtest/Tokio thread stacks on Windows, so the test now
    uses an explicit larger-stack harness as a pragmatic mitigation. That
    may not be the ideal root-cause fix, and it merits a parallel
    investigation into whether part of the async future chain should be
    boxed to reduce stack pressure instead.
    
    ## What Changed
    
    Enterprises can now pin feature values in `requirements.toml` with the
    requirements-side `features` table:
    
    ```toml
    [features]
    personality = true
    unified_exec = false
    ```
    
    Only canonical feature keys are allowed in the requirements `features`
    table; omitted keys remain unconstrained.
    
    - Added a requirements-side pinned feature map to
    `ConfigRequirementsToml`, threaded it through source-preserving
    requirements merge and normalization in `codex-config`, and made the
    TOML surface use `[features]` (while still accepting legacy
    `[feature_requirements]` for compatibility).
    - Exposed `featureRequirements` from `configRequirements/read`,
    regenerated the JSON/TypeScript schema artifacts, and updated the
    app-server README.
    - Wrapped the effective feature set in `ManagedFeatures`, backed by
    `ConstrainedWithSource<Features>`, and changed its API to mirror
    `Constrained<T>`: `can_set(...)`, `set(...) -> ConstraintResult<()>`,
    and result-returning `enable` / `disable` / `set_enabled` helpers.
    - Removed the legacy-usage and bulk-map passthroughs from
    `ManagedFeatures`; callers that need those behaviors now mutate a plain
    `Features` value and reapply it through `set(...)`, so the constrained
    wrapper remains the enforcement boundary.
    - Removed the production loophole for constructing unconstrained
    `ManagedFeatures`. Non-test code now creates it through the configured
    feature-loading path, and `impl From<Features> for ManagedFeatures` is
    restricted to `#[cfg(test)]`.
    - Rejected legacy feature aliases in enterprise feature requirements,
    and return a load error when a pinned combination cannot survive
    dependency normalization.
    - Validated config writes against enterprise feature requirements before
    persisting changes, including explicit conflicting writes and
    profile-specific feature states that normalize into invalid
    combinations.
    - Updated runtime and TUI feature-toggle paths to use the constrained
    setter API and to persist or apply the effective post-constraint value
    rather than the requested value.
    - Updated the `core_test_support` Bazel target to include the bundled
    core model-catalog fixtures in its runtime data, so helper code that
    resolves `core/models.json` through runfiles works in remote Bazel test
    environments.
    - Renamed the core config test coverage to emphasize that effective
    feature values are normalized at runtime, while conflicting persisted
    config writes are rejected.
    - Ran `compact_resume_after_second_compaction_preserves_history` inside
    an explicit 8 MiB test thread and Tokio runtime worker stack, following
    the existing larger-stack integration-test pattern, to keep the Windows
    `compact_resume_fork` test slice from aborting while a parallel
    investigation continues into whether some of the underlying async
    futures should be boxed.
    
    ## Verification
    
    - `cargo test -p codex-config`
    - `cargo test -p codex-core feature_requirements_ -- --nocapture`
    - `cargo test -p codex-core
    load_requirements_toml_produces_expected_constraints -- --nocapture`
    - `cargo test -p codex-core
    compact_resume_after_second_compaction_preserves_history -- --nocapture`
    - `cargo test -p codex-core compact_resume_fork -- --nocapture`
    - Re-ran the built `codex-core` `tests/all` binary with
    `RUST_MIN_STACK=262144` for
    `compact_resume_after_second_compaction_preserves_history` to confirm
    the explicit-stack harness fixes the deterministic low-stack repro.
    - `cargo test -p codex-core`
    - This still fails locally in unrelated integration areas that expect
    the `codex` / `test_stdio_server` binaries or hit existing `search_tool`
    wiremock mismatches.
    
    ## Docs
    
    `developers.openai.com/codex` should document the requirements-side
    `[features]` table for enterprise and MDM-managed configuration,
    including that it only accepts canonical feature keys and that
    conflicting config writes are rejected.
  • add fast mode toggle (#13212)
    - add a local Fast mode setting in codex-core (similar to how model id
    is currently stored on disk locally)
    - send `service_tier=priority` on requests when Fast is enabled
    - add `/fast` in the TUI and persist it locally
    - feature flag
  • Update realtime websocket API (#13265)
    - migrate the realtime websocket transport to the new session and
    handoff flow
    - make the realtime model configurable in config.toml and use API-key
    auth for the websocket
    
    ---------
    
    Co-authored-by: Codex <noreply@openai.com>
  • Support multimodal custom tool outputs (#12948)
    ## Summary
    
    This changes `custom_tool_call_output` to use the same output payload
    shape as `function_call_output`, so freeform tools can return either
    plain text or structured content items.
    
    The main goal is to let `js_repl` return image content from nested
    `view_image` calls in its own `custom_tool_call_output`, instead of
    relying on a separate injected message.
    
    ## What changed
    
    - Changed `custom_tool_call_output.output` from `string` to
    `FunctionCallOutputPayload`
    - Updated freeform tool plumbing to preserve structured output bodies
    - Updated `js_repl` to aggregate nested tool content items and attach
    them to the outer `js_repl` result
    - Removed the old `js_repl` special case that injected `view_image`
    results as a separate pending user image message
    - Updated normalization/history/truncation paths to handle multimodal
    `custom_tool_call_output`
    - Regenerated app-server protocol schema artifacts
    
    ## Behavior
    
    Direct `view_image` calls still return a `function_call_output` with
    image content.
    
    When `view_image` is called inside `js_repl`, the outer `js_repl`
    `custom_tool_call_output` now carries:
    - an `input_text` item if the JS produced text output
    - one or more `input_image` items from nested tool results
    
    So the nested image result now stays inside the `js_repl` tool output
    instead of being injected as a separate message.
    
    ## Compatibility
    
    This is intended to be backward-compatible for resumed conversations.
    
    Older histories that stored `custom_tool_call_output.output` as a plain
    string still deserialize correctly, and older histories that used the
    previous injected-image-message flow also continue to resume.
    
    Added regression coverage for resuming a pre-change rollout containing:
    - string-valued `custom_tool_call_output`
    - legacy injected image message history
    
    
    #### [git stack](https://github.com/magus/git-stack-cli)
    - 👉 `1` https://github.com/openai/codex/pull/12948
  • Allow clients not to send summary as an option (#12950)
    Summary is a required parameter on UserTurn. Ideally we'd like the core
    to decide the appropriate summary level.
    
    Make the summary optional and don't send it when not needed.
  • Agent jobs (spawn_agents_on_csv) + progress UI (#10935)
    ## Summary
    - Add agent job support: spawn a batch of sub-agents from CSV, auto-run,
    auto-export, and store results in SQLite.
    - Simplify workflow: remove run/resume/get-status/export tools; spawn is
    deterministic and completes in one call.
    - Improve exec UX: stable, single-line progress bar with ETA; suppress
    sub-agent chatter in exec.
    
    ## Why
    Enables map-reduce style workflows over arbitrarily large repos using
    the existing Codex orchestrator. This addresses review feedback about
    overly complex job controls and non-deterministic monitoring.
    
    ## Demo (progress bar)
    ```
    ./codex-rs/target/debug/codex exec \
      --enable collab \
      --enable sqlite \
      --full-auto \
      --progress-cursor \
      -c agents.max_threads=16 \
      -C /Users/daveaitel/code/codex \
      - <<'PROMPT'
    Create /tmp/agent_job_progress_demo.csv with columns: path,area and 30 rows:
    path = item-01..item-30, area = test.
    
    Then call spawn_agents_on_csv with:
    - csv_path: /tmp/agent_job_progress_demo.csv
    - instruction: "Run `python - <<'PY'` to sleep a random 0.3–1.2s, then output JSON with keys: path, score (int). Set score = 1."
    - output_csv_path: /tmp/agent_job_progress_demo_out.csv
    PROMPT
    ```
    
    ## Review feedback addressed
    - Auto-start jobs on spawn; removed run/resume/status/export tools.
    - Auto-export on success.
    - More descriptive tool spec + clearer prompts.
    - Avoid deadlocks on spawn failure; pending/running handled safely.
    - Progress bar no longer scrolls; stable single-line redraw.
    
    ## Tests
    - `cd codex-rs && cargo test -p codex-exec`
    - `cd codex-rs && cargo build -p codex-cli`
  • chore: remove codex-core public protocol/shell re-exports (#12432)
    ## Why
    
    `codex-rs/core/src/lib.rs` re-exported a broad set of types and modules
    from `codex-protocol` and `codex-shell-command`. That made it easy for
    workspace crates to import those APIs through `codex-core`, which in
    turn hides dependency edges and makes it harder to reduce compile-time
    coupling over time.
    
    This change removes those public re-exports so call sites must import
    from the source crates directly. Even when a crate still depends on
    `codex-core` today, this makes dependency boundaries explicit and
    unblocks future work to drop `codex-core` dependencies where possible.
    
    ## What Changed
    
    - Removed public re-exports from `codex-rs/core/src/lib.rs` for:
    - `codex_protocol::protocol` and related protocol/model types (including
    `InitialHistory`)
      - `codex_protocol::config_types` (`protocol_config_types`)
    - `codex_shell_command::{bash, is_dangerous_command, is_safe_command,
    parse_command, powershell}`
    - Migrated workspace Rust call sites to import directly from:
      - `codex_protocol::protocol`
      - `codex_protocol::config_types`
      - `codex_protocol::models`
      - `codex_shell_command`
    - Added explicit `Cargo.toml` dependencies (`codex-protocol` /
    `codex-shell-command`) in crates that now import those crates directly.
    - Kept `codex-core` internal modules compiling by using `pub(crate)`
    aliases in `core/src/lib.rs` (internal-only, not part of the public
    API).
    - Updated the two utility crates that can already drop a `codex-core`
    dependency edge entirely:
      - `codex-utils-approval-presets`
      - `codex-utils-cli`
    
    ## Verification
    
    - `cargo test -p codex-utils-approval-presets`
    - `cargo test -p codex-utils-cli`
    - `cargo check --workspace --all-targets`
    - `just clippy`
  • Remove test-support feature from codex-core and replace it with explicit test toggles (#11405)
    ## Why
    
    `codex-core` was being built in multiple feature-resolved permutations
    because test-only behavior was modeled as crate features. For a large
    crate, those permutations increase compile cost and reduce cache reuse.
    
    ## Net Change
    
    - Removed the `test-support` crate feature and related feature wiring so
    `codex-core` no longer needs separate feature shapes for test consumers.
    - Standardized cross-crate test-only access behind
    `codex_core::test_support`.
    - External test code now imports helpers from
    `codex_core::test_support`.
    - Underlying implementation hooks are kept internal (`pub(crate)`)
    instead of broadly public.
    
    ## Outcome
    
    - Fewer `codex-core` build permutations.
    - Better incremental cache reuse across test targets.
    - No intended production behavior change.
  • Remove WebSocket wire format (#10179)
    I'd like WireApi to go away (when chat is removed) and WebSockets is
    still responses API just over a different transport.
  • Use test_codex more (#9961)
    Reduces boilderplate.
  • feat(core) update Personality on turn (#9644)
    ## Summary
    Support updating Personality mid-Thread via UserTurn/OverwriteTurn. This
    is explicitly unused by the clients so far, to simplify PRs - app-server
    and tui implementations will be follow-ups.
    
    ## Testing
    - [x] added integration tests
  • Add text element metadata to types (#9235)
    Initial type tweaking PR to make the diff of
    https://github.com/openai/codex/pull/9116 smaller
    
    This should not change any behavior, just adds some fields to types
  • Support response.done and add integration tests (#9129)
    The agent loop using a persistent incremental web socket connection.
  • chore: unify conversation with thread name (#8830)
    Done and verified by Codex + refactor feature of RustRover
  • feat: agent controller (#8783)
    Added an agent control plane that lets sessions spawn or message other
    conversations via `AgentControl`.
    
    `AgentBus` (core/src/agent/bus.rs) keeps track of the last known status
    of a conversation.
    
    ConversationManager now holds shared state behind an Arc so AgentControl
    keeps only a weak back-reference, the goal is just to avoid explicit
    cycle reference.
    
    Follow-ups:
    * Build a small tool in the TUI to be able to see every agent and send
    manual message to each of them
    * Handle approval requests in this TUI
    * Add tools to spawn/communicate between agents (see related design)
    * Define agent types
  • feat: introduce codex-utils-cargo-bin as an alternative to assert_cmd::Command (#8496)
    This PR introduces a `codex-utils-cargo-bin` utility crate that
    wraps/replaces our use of `assert_cmd::Command` and
    `escargot::CargoBuild`.
    
    As you can infer from the introduction of `buck_project_root()` in this
    PR, I am attempting to make it possible to build Codex under
    [Buck2](https://buck2.build) as well as `cargo`. With Buck2, I hope to
    achieve faster incremental local builds (largely due to Buck2's
    [dice](https://buck2.build/docs/insights_and_knowledge/modern_dice/)
    build strategy, as well as benefits from its local build daemon) as well
    as faster CI builds if we invest in remote execution and caching.
    
    See
    https://buck2.build/docs/getting_started/what_is_buck2/#why-use-buck2-key-advantages
    for more details about the performance advantages of Buck2.
    
    Buck2 enforces stronger requirements in terms of build and test
    isolation. It discourages assumptions about absolute paths (which is key
    to enabling remote execution). Because the `CARGO_BIN_EXE_*` environment
    variables that Cargo provides are absolute paths (which
    `assert_cmd::Command` reads), this is a problem for Buck2, which is why
    we need this `codex-utils-cargo-bin` utility.
    
    My WIP-Buck2 setup sets the `CARGO_BIN_EXE_*` environment variables
    passed to a `rust_test()` build rule as relative paths.
    `codex-utils-cargo-bin` will resolve these values to absolute paths,
    when necessary.
    
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/8496).
    * #8498
    * __->__ #8496
  • chore: migrate from Config::load_from_base_config_with_overrides to ConfigBuilder (#8276)
    https://github.com/openai/codex/pull/8235 introduced `ConfigBuilder` and
    this PR updates all call non-test call sites to use it instead of
    `Config::load_from_base_config_with_overrides()`.
    
    This is important because `load_from_base_config_with_overrides()` uses
    an empty `ConfigRequirements`, which is a reasonable default for testing
    so the tests are not influenced by the settings on the host. This method
    is now guarded by `#[cfg(test)]` so it cannot be used by business logic.
    
    Because `ConfigBuilder::build()` is `async`, many of the test methods
    had to be migrated to be `async`, as well. On the bright side, this made
    it possible to eliminate a bunch of `block_on_future()` stuff.
  • Reimplement skills loading using SkillsManager + skills/list op. (#7914)
    refactor the way we load and manage skills:
    1. Move skill discovery/caching into SkillsManager and reuse it across
    sessions.
    2. Add the skills/list API (Op::ListSkills/SkillsListResponse) to fetch
    skills for one or more cwds. Also update app-server for VSCE/App;
    3. Trigger skills/list during session startup so UIs preload skills and
    handle errors immediately.
  • Inject SKILL.md when it's explicitly mentioned. (#7763)
    1. Skills load once in core at session start; the cached outcome is
    reused across core and surfaced to TUI via SessionConfigured.
    2. TUI detects explicit skill selections, and core injects the matching
    SKILL.md content into the turn when a selected skill is present.
  • make model optional in config (#7769)
    - Make Config.model optional and centralize default-selection logic in
    ModelsManager, including a default_model helper (with
    codex-auto-balanced when available) so sessions now carry an explicit
    chosen model separate from the base config.
    - Resolve `model` once in `core` and `tui` from config. Then store the
    state of it on other structs.
    - Move refreshing models to be before resolving the default model
  • remove model_family from `config (#7571)
    - Remove `model_family` from `config`
    - Make sure to still override config elements related to `model_family`
    like supporting reasoning
  • Migrate model family to models manager (#7565)
    This PR moves `ModelsFamily` to `openai_models`. It also propagates
    `ModelsManager` to session services and use it to drive model family. We
    also make `derive_default_model_family` private because it's a step
    towards what we want: one place that gives model configuration.
    
    This is a second step at having one source of truth for models
    information and config: `ModelsManager`.
    
    Next steps would be to remove `ModelsFamily` from config. That's massive
    because it's being used in 41 occasions mostly pre launching `codex`.
    Also, we need to make `find_family_for_model` private. It's also big
    because it's being used in 21 occasions ~ all tests.
  • fix(apply_patch) tests for shell_command (#7307)
    ## Summary
    Adds test coverage for invocations of apply_patch via shell_command with
    heredoc, to validate behavior.
    
    ## Testing
    - [x] These are tests
  • feat: remote compaction (#6795)
    Co-authored-by: pakrym-oai <pakrym@openai.com>
  • chore(core) Add shell_serialization coverage (#6810)
    ## Summary
    Similar to #6545, this PR updates the shell_serialization test suite to
    cover the various `shell` tool invocations we have. Note that this does
    not cover unified_exec, which has its own suite of tests. This should
    provide some test coverage for when we eventually consolidate
    serialization logic.
    
    ## Testing
    - [x] These are tests
  • Promote shared helpers for suite tests (#6460)
    ## Summary
    - add `TestCodex::submit_turn_with_policies` and extend the response
    helpers with reusable tool-call utilities
    - update the grep_files, read_file, list_dir, shell_serialization, and
    tools suites to rely on the shared helpers instead of local copies
    - make the list_dir helper return `anyhow::Result` so clippy no longer
    warns about `expect`
    
    ## Testing
    - `just fix -p codex-core`
    - `cargo test -p codex-core --test all
    suite::grep_files::grep_files_tool_collects_matches`
    - `cargo test -p codex-core
    suite::grep_files::grep_files_tool_collects_matches -- --ignored`
    (filter requests ignored tests so nothing runs, but the build stays
    clean)
    
    
    ------
    [Codex
    Task](https://chatgpt.com/codex/tasks/task_i_69112d53abac83219813cab4d7cb6446)
  • chore(core) Consolidate apply_patch tests (#6545)
    ## Summary
    Consolidates our apply_patch tests into one suite, and ensures each test
    case tests the various ways the harness supports apply_patch:
    1. Freeform custom tool call
    2. JSON function tool
    3. Simple shell call
    4. Heredoc shell call
    
    There are a few test cases that are specific to a particular variant,
    I've left those alone.
    
    ## Testing
    - [x] This adds a significant number of tests
  • Set verbosity to low for 5.1 (#6568)
    And improve test coverage
  • chore: testing on freeform apply_patch (#5952)
    ## Summary
    Duplicates the tests in `apply_patch_cli.rs`, but tests the freeform
    apply_patch tool as opposed to the function call path. The good news is
    that all the tests pass with zero logical tests, with the exception of
    the heredoc, which doesn't really make sense in the freeform tool
    context anyway.
    
    @jif-oai since you wrote the original tests in #5557, I'd love your
    opinion on the right way to DRY these test cases between the two. Happy
    to set up a more sophisticated harness, but didn't want to go down the
    rabbit hole until we agreed on the right pattern
    
    ## Testing
    - [x] These are tests
  • feat: deprecation warning (#5825)
    <img width="955" height="311" alt="Screenshot 2025-10-28 at 14 26 25"
    src="https://github.com/user-attachments/assets/99729b3d-3bc9-4503-aab3-8dc919220ab4"
    />
  • fix: apply_patch shell_serialization tests (#4786)
    ## Summary
    Adds additional shell_serialization tests specifically for apply_patch
    and other cases.
    
    ## Test Plan
    - [x] These are all tests
  • [MCP] Add support for MCP Oauth credentials (#4517)
    This PR adds oauth login support to streamable http servers when
    `experimental_use_rmcp_client` is enabled.
    
    This PR is large but represents the minimal amount of work required for
    this to work. To keep this PR smaller, login can only be done with
    `codex mcp login` and `codex mcp logout` but it doesn't appear in `/mcp`
    or `codex mcp list` yet. Fingers crossed that this is the last large MCP
    PR and that subsequent PRs can be smaller.
    
    Under the hood, credentials are stored using platform credential
    managers using the [keyring crate](https://crates.io/crates/keyring).
    When the keyring isn't available, it falls back to storing credentials
    in `CODEX_HOME/.credentials.json` which is consistent with how other
    coding agents handle authentication.
    
    I tested this on macOS, Windows, WSL (ubuntu), and Linux. I wasn't able
    to test the dbus store on linux but did verify that the fallback works.
    
    One quirk is that if you have credentials, during development, every
    build will have its own ad-hoc binary so the keyring won't recognize the
    reader as being the same as the write so it may ask for the user's
    password. I may add an override to disable this or allow
    users/enterprises to opt-out of the keyring storage if it causes issues.
    
    <img width="5064" height="686" alt="CleanShot 2025-09-30 at 19 31 40"
    src="https://github.com/user-attachments/assets/9573f9b4-07f1-4160-83b8-2920db287e2d"
    />
    <img width="745" height="486" alt="image"
    src="https://github.com/user-attachments/assets/9562649b-ea5f-4f22-ace2-d0cb438b143e"
    />
  • Add notifier tests (#4064)
    Proposal:
    1. Use anyhow for tests and avoid unwrap
    2. Extract a helper for starting a test instance of codex