36 Commits

  • test: branch on target OS instead of runner flavor (#29712)
    ## Why
    
    Core tests should branch on the executor's operating system, not on
    runner details such as Docker or Wine. This keeps platform behavior
    stable as new test backends are added and reserves Wine-specific skips
    for actual runner debt.
    
    ## What
    
    - Add `TestTargetOs` and target/host-aware skip helpers while keeping
    `TestEnvironment` internal.
    - Replace topology enum access with remote predicates and a narrow
    Docker accessor.
    - Migrate OS-semantic Wine skips, preserve runner-specific gaps, and
    document the skip taxonomy.
    
    ## Validation
    
    - `just test -p core_test_support`
    - `just test -p codex-core
    remote_test_env_can_connect_and_use_filesystem`
    - `bazel test //codex-rs/core:core-all-wine-exec-test
    --test_output=errors` reached test execution; unrelated existing
    view-image, path, and timing failures remain.
    - `just test -p codex-core` and `just test` reached broad test
    execution; this checkout has unrelated helper, sandbox, and timing
    failures.
  • [codex] Use expect in integration tests (#28441)
    The workspace denies `clippy::expect_used` in production. Although
    `clippy.toml` allows `expect` in tests, Bazel Clippy compiles
    integration-test helper code in a way that does not receive that
    exemption, which encouraged verbose `unwrap_or_else(... panic!(...))`
    and equivalent `match`/`let else` forms.
    
    This allows `clippy::expect_used` once at each integration-test crate
    root (including aggregated suites and test-support libraries), then
    replaces manual panic-based Result and Option unwraps with
    `expect`/`expect_err`. Standalone `tests/*.rs` files remain their own
    crate roots. Intentional assertion and unexpected-variant panics remain
    unchanged, and the production `expect_used = "deny"` lint remains in
    place.
    
    The cleanup is mechanical and net-negative in line count.
  • Run core integration tests against a Wine-backed Windows executor (#28401)
    ## Why
    
    We want to exercise a linux app-server against a windows exec-server
    without having to repeat every test case. This approach has slight
    precedent in the remote docker test setup.
    
    ## What
    
    Run the shared `codex-core` integration suite against Windows
    exec-server behavior from Linux. This makes cross-OS path and shell
    regressions visible while keeping unsupported cases owned by individual
    tests.
    
    - Add `local`, `docker`, and `wine-exec` test environment selection with
    legacy Docker compatibility.
    - Extend `codex_rust_crate` to generate a sharded Wine-exec variant
    using a cross-built Windows server and pinned Bazel Wine/PowerShell
    runtimes.
    - Teach remote-aware helpers about Windows paths and track temporary
    incompatibilities with source-local `skip_if_wine_exec!` calls and
    follow-up reasons.
  • [codex] Remove legacy shell output formatting paths (#22706)
    ## Why
    
    The client and tool pipeline still carried compatibility code for legacy
    structured shell output. Current shell and apply_patch responses are
    already plain text for model consumption, so keeping a
    JSON-serialization path plus shell-item rewrite logic makes the request
    formatter and tests preserve a format we do not need anymore.
    
    ## What Changed
    
    - Removed the client-side shell output rewrite from
    `core/src/client_common.rs`.
    - Removed the structured exec-output formatter and the shell `freeform`
    switch so tool emitters use one model-facing formatter.
    - Collapsed apply_patch/shell serialization tests around the remaining
    plain-text output expectations and removed duplicate one-variant
    parameterized cases.
    - Kept the `ApplyPatchModelOutput::ShellCommandViaHeredoc` compatibility
    input shape, but no longer treats it as a separate output-format mode.
    
    ## Validation
    
    - `cargo test -p codex-core client_common`
    - `cargo test -p codex-core shell_serialization`
    - `cargo test -p codex-core apply_patch_cli`
    - `just fix -p codex-core`
    
    ## Documentation
    
    No external Codex documentation update is needed.
  • chore(features) rm Feature::ApplyPatchFreeform (#22711)
    ## Summary
    Removes the feature since this is effectively on by default in all cases
    where we should use it, or can be configured via models.json.
    
    ## Testing
    - [x] unit tests pass
  • [codex] Remove unused legacy shell tools (#22246)
    ## Why
    
    Recent session history showed no active use of the raw `shell`,
    `local_shell`, or `container.exec` execution surfaces. Keeping those
    handlers/specs wired into core leaves duplicate shell execution paths
    alongside the supported `shell_command` and unified exec tools.
    
    ## What changed
    
    - Removed the raw `shell` handler/spec and its `ShellToolCallParams`
    protocol helper.
    - Removed the legacy `local_shell` and `container.exec` handler/spec
    plumbing while preserving persisted-history compatibility for old
    response items.
    - Normalized model/config `default` and `local` shell selections to
    `shell_command`.
    - Pruned tests that exercised removed raw-shell/local-shell/apply-patch
    variants and kept coverage on `shell_command`, unified exec, and
    freeform `apply_patch`.
    
    ## Verification
    
    - `git diff --check`
    - `cargo test -p codex-protocol`
    - `cargo test -p codex-tools`
    - `cargo test -p codex-core tools::handlers::shell`
    - `cargo test -p codex-core tools::spec`
    - `cargo test -p codex-core tools::router`
    - `cargo test -p codex-core
    active_call_preserves_triggering_command_context`
    - `cargo test -p codex-core guardian_tests`
    - `cargo test -p codex-core --test all shell_serialization`
    - `cargo test -p codex-core --test all apply_patch_cli`
    - `cargo test -p codex-core --test all shell_command_`
    - `cargo test -p codex-core --test all local_shell`
    - `cargo test -p codex-core --test all otel::`
    - `cargo test -p codex-core --test all hooks::`
    - `just fix -p codex-core`
    - `just fix -p codex-tools`
  • [codex] Delete function-style apply_patch (#21651)
    ## Why
    
    `apply_patch` is now a freeform/custom tool. Keeping the old
    JSON/function-style registration and parsing path left another way for
    models and tests to invoke `apply_patch`, which made the tool surface
    harder to reason about.
    
    ## What changed
    
    - Removed the `ApplyPatchToolType::Function` variant, JSON `apply_patch`
    spec, and handler support for function payloads.
    - Kept `apply_patch_tool_type = freeform` as the supported model
    metadata path, including Bedrock catalog metadata.
    - Migrated `apply_patch` tests and SSE fixtures to custom/freeform tool
    calls.
    
    ## Verification
    
    - `cargo test -p codex-tools -p codex-protocol -p codex-model-provider`
    - `cargo test -p codex-core tools::handlers::apply_patch --lib`
    - `cargo test -p codex-core --test all
    apply_patch_tool_executes_and_emits_patch_events`
    - `cargo test -p codex-core --test all
    apply_patch_reports_parse_diagnostics`
    - `cargo test -p codex-exec test_apply_patch_tool`
    - `just fix -p codex-core`
    - `just fix -p codex-tools -p codex-protocol -p codex-model-provider -p
    codex-exec`
  • core tests: submit turns with permission profiles (#20010)
    ## Summary
    
    - Add `PermissionProfile`-based turn submission helpers to
    `core_test_support`, while keeping the legacy `SandboxPolicy` helper for
    tests that intentionally exercise legacy fallback behavior.
    - Switch the default `TestCodex::submit_turn()` path to send a real
    `PermissionProfile` plus the required legacy compatibility projection in
    `Op::UserTurn`.
    - Migrate straightforward app/search/shell/truncation tests from
    `SandboxPolicy::{DangerFullAccess, ReadOnly}` to
    `PermissionProfile::{Disabled, read_only}`.
    - Add a TUI compatibility projection helper for legacy app-server fields
    so non-legacy writable roots are preserved instead of being downgraded
    to read-only.
    - Fix remote start/resume/fork sandbox-mode projection to classify any
    managed profile with writable roots as workspace-write, not only
    profiles that can write `cwd`.
    - Reduce `SandboxPolicy` references in `codex-rs/core/tests` from 47
    files to 41 files without changing production behavior.
    
    ## Testing
    
    - `cargo check -p codex-core --tests`
    - `cargo test -p codex-tui
    compatibility_profile_preserves_unbridgeable_write_roots`
    - `cargo test -p codex-tui
    sandbox_mode_preserves_non_cwd_write_roots_for_remote_sessions`
    - `just fmt`
    - `just fix -p core_test_support`
    - `just fix -p codex-core`
  • Update models.json (#18586)
    - Replace the active models-manager catalog with the deleted core
    catalog contents.
    - Replace stale hardcoded test model slugs with current bundled model
    slugs.
    - Keep this as a stacked change on top of the cleanup PR.
  • [codex] Apply patches through executor filesystem (#17048)
    ## Summary
    - run apply_patch through the executor filesystem when a remote
    environment is present instead of shelling out to the local process
    - thread the executor FileSystem into apply_patch interception and keep
    existing local behavior for non-remote turns
    - make the apply_patch integration harness use the executor filesystem
    for setup/assertions
    - add remote-aware skips for turn-diff coverage that still reads the
    test-runner filesystem
    
    ## Why
    Remote apply_patch needed to mutate the remote workspace instead of the
    local checkout. The tests also needed to seed and assert workspace state
    through the same filesystem abstraction so local and remote runs
    exercise the same behavior.
    
    ## Validation
    - `just fmt`
    - `git diff --check`
    - `cargo check -p core_test_support --tests`
    - `cargo test -p codex-core --test all
    suite::shell_serialization::apply_patch_custom_tool_call -- --nocapture`
    - `cargo test -p codex-core --test all
    suite::apply_patch_cli::apply_patch_cli_updates_file_appends_trailing_newline
    -- --nocapture`
    - remote `cargo test -p codex-core --test all apply_patch_cli --
    --nocapture` (229 passed)
  • chore: clean up argument-comment lint and roll out all-target CI on macOS (#16054)
    ## Why
    
    `argument-comment-lint` was green in CI even though the repo still had
    many uncommented literal arguments. The main gap was target coverage:
    the repo wrapper did not force Cargo to inspect test-only call sites, so
    examples like the `latest_session_lookup_params(true, ...)` tests in
    `codex-rs/tui_app_server/src/lib.rs` never entered the blocking CI path.
    
    This change cleans up the existing backlog, makes the default repo lint
    path cover all Cargo targets, and starts rolling that stricter CI
    enforcement out on the platform where it is currently validated.
    
    ## What changed
    
    - mechanically fixed existing `argument-comment-lint` violations across
    the `codex-rs` workspace, including tests, examples, and benches
    - updated `tools/argument-comment-lint/run-prebuilt-linter.sh` and
    `tools/argument-comment-lint/run.sh` so non-`--fix` runs default to
    `--all-targets` unless the caller explicitly narrows the target set
    - fixed both wrappers so forwarded cargo arguments after `--` are
    preserved with a single separator
    - documented the new default behavior in
    `tools/argument-comment-lint/README.md`
    - updated `rust-ci` so the macOS lint lane keeps the plain wrapper
    invocation and therefore enforces `--all-targets`, while Linux and
    Windows temporarily pass `-- --lib --bins`
    
    That temporary CI split keeps the stricter all-targets check where it is
    already cleaned up, while leaving room to finish the remaining Linux-
    and Windows-specific target-gated cleanup before enabling
    `--all-targets` on those runners. The Linux and Windows failures on the
    intermediate revision were caused by the wrapper forwarding bug, not by
    additional lint findings in those lanes.
    
    ## Validation
    
    - `bash -n tools/argument-comment-lint/run.sh`
    - `bash -n tools/argument-comment-lint/run-prebuilt-linter.sh`
    - shell-level wrapper forwarding check for `-- --lib --bins`
    - shell-level wrapper forwarding check for `-- --tests`
    - `just argument-comment-lint`
    - `cargo test` in `tools/argument-comment-lint`
    - `cargo test -p codex-terminal-detection`
    
    ## Follow-up
    
    - Clean up remaining Linux-only target-gated callsites, then switch the
    Linux lint lane back to the plain wrapper invocation.
    - Clean up remaining Windows-only target-gated callsites, then switch
    the Windows lint lane back to the plain wrapper invocation.
  • Stabilize shell serialization tests (#13877)
    ## What changed
    - The duration-recording fixture sleep was reduced from a large
    artificial delay to `0.2s`, and the assertion floor was lowered to
    `0.1s`.
    - The shell tool fixtures now force `login = false` so they do not
    invoke login-shell startup paths.
    
    ## Why this fixes the flake
    - The old tests were paying for two kinds of noise that had nothing to
    do with the feature being validated: oversized sleep time and variable
    shell initialization cost.
    - Login shells can pick up runner-specific startup files and incur
    inconsistent startup latency.
    - The test only needs to prove that we record a nontrivial duration and
    preserve shell output. A shorter fixture delay plus a non-login shell
    keeps that coverage while removing runner-dependent wall-clock variance.
    
    ## Scope
    - Test-only change.
  • chore: remove codex-core public protocol/shell re-exports (#12432)
    ## Why
    
    `codex-rs/core/src/lib.rs` re-exported a broad set of types and modules
    from `codex-protocol` and `codex-shell-command`. That made it easy for
    workspace crates to import those APIs through `codex-core`, which in
    turn hides dependency edges and makes it harder to reduce compile-time
    coupling over time.
    
    This change removes those public re-exports so call sites must import
    from the source crates directly. Even when a crate still depends on
    `codex-core` today, this makes dependency boundaries explicit and
    unblocks future work to drop `codex-core` dependencies where possible.
    
    ## What Changed
    
    - Removed public re-exports from `codex-rs/core/src/lib.rs` for:
    - `codex_protocol::protocol` and related protocol/model types (including
    `InitialHistory`)
      - `codex_protocol::config_types` (`protocol_config_types`)
    - `codex_shell_command::{bash, is_dangerous_command, is_safe_command,
    parse_command, powershell}`
    - Migrated workspace Rust call sites to import directly from:
      - `codex_protocol::protocol`
      - `codex_protocol::config_types`
      - `codex_protocol::models`
      - `codex_shell_command`
    - Added explicit `Cargo.toml` dependencies (`codex-protocol` /
    `codex-shell-command`) in crates that now import those crates directly.
    - Kept `codex-core` internal modules compiling by using `pub(crate)`
    aliases in `core/src/lib.rs` (internal-only, not part of the public
    API).
    - Updated the two utility crates that can already drop a `codex-core`
    dependency edge entirely:
      - `codex-utils-approval-presets`
      - `codex-utils-cli`
    
    ## Verification
    
    - `cargo test -p codex-utils-approval-presets`
    - `cargo test -p codex-utils-cli`
    - `cargo check --workspace --all-targets`
    - `just clippy`
  • remove model_family from `config (#7571)
    - Remove `model_family` from `config`
    - Make sure to still override config elements related to `model_family`
    like supporting reasoning
  • Migrate model family to models manager (#7565)
    This PR moves `ModelsFamily` to `openai_models`. It also propagates
    `ModelsManager` to session services and use it to drive model family. We
    also make `derive_default_model_family` private because it's a step
    towards what we want: one place that gives model configuration.
    
    This is a second step at having one source of truth for models
    information and config: `ModelsManager`.
    
    Next steps would be to remove `ModelsFamily` from config. That's massive
    because it's being used in 41 occasions mostly pre launching `codex`.
    Also, we need to make `find_family_for_model` private. It's also big
    because it's being used in 21 occasions ~ all tests.
  • core: make shell behavior portable on FreeBSD (#7039)
    - Use /bin/sh instead of /bin/bash on FreeBSD/OpenBSD in the process
    group timeout test to avoid command-not-found failures.
    
    - Accept /usr/local/bin/bash as a valid SHELL path to match common
    FreeBSD installations.
    
    - Switch the shell serialization duration test to /bin/sh for improved
    portability across Unix platforms.
    
    With this change, `cargo test -p codex-core --lib` runs and passes on
    FreeBSD.
  • Move shell to use truncate_text (#6842)
    Move shell to use the configurable `truncate_text`
    
    ---------
    
    Co-authored-by: pakrym-oai <pakrym@openai.com>
  • shell_command returns freeform output (#6860)
    Instead of returning structured out and then re-formatting it into
    freeform, return the freeform output from shell_command tool.
    
    Keep `shell` as the default tool for GPT-5.
  • chore(core) Add shell_serialization coverage (#6810)
    ## Summary
    Similar to #6545, this PR updates the shell_serialization test suite to
    cover the various `shell` tool invocations we have. Note that this does
    not cover unified_exec, which has its own suite of tests. This should
    provide some test coverage for when we eventually consolidate
    serialization logic.
    
    ## Testing
    - [x] These are tests
  • Update defaults to gpt-5.1 (#6652)
    ## Summary
    - update documentation, example configs, and automation defaults to
    reference gpt-5.1 / gpt-5.1-codex
    - bump the CLI and core configuration defaults, model presets, and error
    messaging to the new models while keeping the model-family/tool coverage
    for legacy slugs
    - refresh tests, fixtures, and TUI snapshots so they expect the upgraded
    defaults
    
    ## Testing
    - `cargo test -p codex-core
    config::tests::test_precedence_fixture_with_gpt5_profile`
    
    
    ------
    [Codex
    Task](https://chatgpt.com/codex/tasks/task_i_6916c5b3c2b08321ace04ee38604fc6b)
  • fix(core) serialize shell_command (#6744)
    ## Summary
    Ensures we're serializing calls to `shell_command`
    
    ## Testing
    - [x] Added unit test
  • Promote shared helpers for suite tests (#6460)
    ## Summary
    - add `TestCodex::submit_turn_with_policies` and extend the response
    helpers with reusable tool-call utilities
    - update the grep_files, read_file, list_dir, shell_serialization, and
    tools suites to rely on the shared helpers instead of local copies
    - make the list_dir helper return `anyhow::Result` so clippy no longer
    warns about `expect`
    
    ## Testing
    - `just fix -p codex-core`
    - `cargo test -p codex-core --test all
    suite::grep_files::grep_files_tool_collects_matches`
    - `cargo test -p codex-core
    suite::grep_files::grep_files_tool_collects_matches -- --ignored`
    (filter requests ignored tests so nothing runs, but the build stays
    clean)
    
    
    ------
    [Codex
    Task](https://chatgpt.com/codex/tasks/task_i_69112d53abac83219813cab4d7cb6446)
  • chore: Add shell serialization tests for json (#6043)
    ## Summary
    Can never have enough tests on this code path - checking that json
    inside a shell call is deserialized correctly.
    
    ## Tests
    - [x] These are tests 😎
  • Add ItemStarted/ItemCompleted events for UserInputItem (#5306)
    Adds a new ItemStarted event and delivers UserMessage as the first item
    type (more to come).
    
    
    Renames `InputItem` to `UserInput` considering we're using the `Item`
    suffix for actual items.
  • fix(core) use regex for all shell_serialization tests (#5189)
    ## Summary
    Thought I switched all of these to using a regex instead, but missed 2.
    This should address our [flakey test
    problem](https://github.com/openai/codex/actions/runs/18511206616/job/52752341520?pr=5185).
    
    ## Test Plan
    - [x] Only updated unit tests
  • fix: apply_patch shell_serialization tests (#4786)
    ## Summary
    Adds additional shell_serialization tests specifically for apply_patch
    and other cases.
    
    ## Test Plan
    - [x] These are all tests
  • feat: feature flag (#4948)
    Add proper feature flag instead of having custom flags for everything.
    This is just for experimental/wip part of the code
    It can be used through CLI:
    ```bash
    codex --enable unified_exec --disable view_image_tool
    ```
    
    Or in the `config.toml`
    ```toml
    # Global toggles applied to every profile unless overridden.
    [features]
    apply_patch_freeform = true
    view_image_tool = false
    ```
    
    Follow-up:
    In a following PR, the goal is to have a default have `bundles` of
    features that we can associate to a model
  • feat: make shortcut works even with capslock (#5049)
    Shortcut where not working in caps-lock. Fixing this
  • Make output assertions more explicit (#4784)
    Match using precise regexes.
  • Add helper for response created SSE events in tests (#4758)
    ## Summary
    - add a reusable `ev_response_created` helper that builds
    `response.created` SSE events for integration tests
    - update the exec and core integration suites to use the new helper
    instead of repeating manual JSON literals
    - keep the streaming fixtures consistent by relying on the shared helper
    in every touched test
    
    ## Testing
    - `just fmt`
    
    
    ------
    https://chatgpt.com/codex/tasks/task_i_68e1fe885bb883208aafffb94218da61
  • Use wait_for_event helpers in tests (#4753)
    ## Summary
    - replace manual event polling loops in several core test suites with
    the shared wait_for_event helpers
    - keep prior assertions intact by using closure captures for stateful
    expectations, including plan updates, patch lifecycles, and review flow
    checks
    - rely on wait_for_event_with_timeout where longer waits are required,
    simplifying timeout handling
    
    ## Testing
    - just fmt
    
    
    ------
    https://chatgpt.com/codex/tasks/task_i_68e1d58582d483208febadc5f90dd95e
  • feat: Freeform apply_patch with simple shell output (#4718)
    ## Summary
    This PR is an alternative approach to #4711, but instead of changing our
    storage, parses out shell calls in the client and reserializes them on
    the fly before we send them out as part of the request.
    
    What this changes:
    1. Adds additional serialization logic when the
    ApplyPatchToolType::Freeform is in use.
    2. Adds a --custom-apply-patch flag to enable this setting on a
    session-by-session basis.
    
    This change is delicate, but is not meant to be permanent. It is meant
    to be the first step in a migration:
    1. (This PR) Add in-flight serialization with config
    2. Update model_family default
    3. Update serialization logic to store turn outputs in a structured
    format, with logic to serialize based on model_family setting.
    4. Remove this rewrite in-flight logic.
    
    ## Test Plan
    - [x] Additional unit tests added
    - [x] Integration tests added
    - [x] Tested locally