Commit Graph

43 Commits

  • Add remote env CI matrix and integration test (#14869)
    `CODEX_TEST_REMOTE_ENV` will make `test_codex` start the executor
    "remotely" (inside a docker container) turning any integration test into
    remote test.
  • Split features into codex-features crate (#15253)
    - Split the feature system into a new `codex-features` crate.
    - Cut `codex-core` and workspace consumers over to the new config and
    warning APIs.
    
    Co-authored-by: Ahmed Ibrahim <219906144+aibrahim-oai@users.noreply.github.com>
    Co-authored-by: Codex <noreply@openai.com>
  • feat(core, tracing): create turn spans over websockets (#14632)
    ## Description
    
    Dependent on:
    - [responsesapi] https://github.com/openai/openai/pull/760991 
    - [codex-backend] https://github.com/openai/openai/pull/760985
    
    `codex app-server -> codex-backend -> responsesapi` now reuses a
    persistent websocket connection across many turns. This PR updates
    tracing when using websockets so that each `response.create` websocket
    request propagates the current tracing context, so we can get a holistic
    end-to-end trace for each turn.
    
    Tracing is propagated via special keys (`ws_request_header_traceparent`,
    `ws_request_header_tracestate`) set in the `client_metadata` param in
    Responses API.
    
    Currently tracing on websockets is a bit broken because we only set
    tracing context on ws connection time, so it's detached from a
    `turn/start` request.
  • Apply argument comment lint across codex-rs (#14652)
    ## Why
    
    Once the repo-local lint exists, `codex-rs` needs to follow the
    checked-in convention and CI needs to keep it from drifting. This commit
    applies the fallback `/*param*/` style consistently across existing
    positional literal call sites without changing those APIs.
    
    The longer-term preference is still to avoid APIs that require comments
    by choosing clearer parameter types and call shapes. This PR is
    intentionally the mechanical follow-through for the places where the
    existing signatures stay in place.
    
    After rebasing onto newer `main`, the rollout also had to cover newly
    introduced `tui_app_server` call sites. That made it clear the first cut
    of the CI job was too expensive for the common path: it was spending
    almost as much time installing `cargo-dylint` and re-testing the lint
    crate as a representative test job spends running product tests. The CI
    update keeps the full workspace enforcement but trims that extra
    overhead from ordinary `codex-rs` PRs.
    
    ## What changed
    
    - keep a dedicated `argument_comment_lint` job in `rust-ci`
    - mechanically annotate remaining opaque positional literals across
    `codex-rs` with exact `/*param*/` comments, including the rebased
    `tui_app_server` call sites that now fall under the lint
    - keep the checked-in style aligned with the lint policy by using
    `/*param*/` and leaving string and char literals uncommented
    - cache `cargo-dylint`, `dylint-link`, and the relevant Cargo
    registry/git metadata in the lint job
    - split changed-path detection so the lint crate's own `cargo test` step
    runs only when `tools/argument-comment-lint/*` or `rust-ci.yml` changes
    - continue to run the repo wrapper over the `codex-rs` workspace, so
    product-code enforcement is unchanged
    
    Most of the code changes in this commit are intentionally mechanical
    comment rewrites or insertions driven by the lint itself.
    
    ## Verification
    
    - `./tools/argument-comment-lint/run.sh --workspace`
    - `cargo test -p codex-tui-app-server -p codex-tui`
    - parsed `.github/workflows/rust-ci.yml` locally with PyYAML
    
    ---
    
    * -> #14652
    * #14651
  • fix(ci) fix guardian ci (#13911)
    ## Summary
    #13910 was merged with some unused imports, let's fix this
    
    ## Testing
    - [x] Let's make sure CI is green
    
    ---------
    
    Co-authored-by: Charles Cunningham <ccunningham@openai.com>
    Co-authored-by: Codex <noreply@openai.com>
  • core: adopt host_executable() rules in zsh-fork (#13046)
    ## Why
    
    [#12964](https://github.com/openai/codex/pull/12964) added
    `host_executable()` support to `codex-execpolicy`, but the zsh-fork
    interception path in `unix_escalation.rs` was still evaluating commands
    with the default exact-token matcher.
    
    That meant an intercepted absolute executable such as `/usr/bin/git
    status` could still miss basename rules like `prefix_rule(pattern =
    ["git", "status"])`, even when the policy also defined a matching
    `host_executable(name = "git", ...)` entry.
    
    This PR adopts the new matching behavior in the zsh-fork runtime only.
    That keeps the rollout intentionally narrow: zsh-fork already requires
    explicit user opt-in, so it is a safer first caller to exercise the new
    `host_executable()` scheme before expanding it to other execpolicy call
    sites.
    
    It also brings zsh-fork back in line with the current `prefix_rule()`
    execution model. Until prefix rules can carry their own permission
    profiles, a matched `prefix_rule()` is expected to rerun the intercepted
    command unsandboxed on `allow`, or after the user accepts `prompt`,
    instead of merely continuing inside the inherited shell sandbox.
    
    ## What Changed
    
    - added `evaluate_intercepted_exec_policy()` in
    `core/src/tools/runtimes/shell/unix_escalation.rs` to centralize
    execpolicy evaluation for intercepted commands
    - switched intercepted direct execs in the zsh-fork path to
    `check_multiple_with_options(...)` with `MatchOptions {
    resolve_host_executables: true }`
    - added `commands_for_intercepted_exec_policy()` so zsh-fork policy
    evaluation works from intercepted `(program, argv)` data instead of
    reconstructing a synthetic command before matching
    - left shell-wrapper parsing intentionally disabled by default behind
    `ENABLE_INTERCEPTED_EXEC_POLICY_SHELL_WRAPPER_PARSING`, so
    path-sensitive matching relies on later direct exec interception rather
    than shell-script parsing
    - made matched `prefix_rule()` decisions rerun intercepted commands with
    `EscalationExecution::Unsandboxed`, while unmatched-command fallback
    keeps the existing sandbox-preserving behavior
    - extracted the zsh-fork test harness into
    `core/tests/common/zsh_fork.rs` so both the skill-focused and
    approval-focused integration suites can exercise the same runtime setup
    - limited this change to the intercepted zsh-fork path rather than
    changing every execpolicy caller at once
    - added runtime coverage in
    `core/src/tools/runtimes/shell/unix_escalation_tests.rs` for allowed and
    disallowed `host_executable()` mappings and the wrapper-parsing modes
    - added integration coverage in `core/tests/suite/approvals.rs` to
    verify a saved `prefix_rule(pattern=["touch"], decision="allow")` reruns
    under zsh-fork outside a restrictive `WorkspaceWrite` sandbox
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/13046).
    * #13065
    * __->__ #13046
  • test: vendor zsh fork via DotSlash and stabilize zsh-fork tests (#12518)
    ## Why
    
    The zsh integration tests were still brittle in two ways:
    
    - they relied on `CODEX_TEST_ZSH_PATH` / environment-specific setup, so
    they often did not exercise the patched zsh fork that `shell-tool-mcp`
    ships
    - once the tests consistently used the vendored zsh fork, they exposed
    real Linux-specific zsh-fork issues in CI
    
    In particular, the Linux failures were not just test noise:
    
    - the zsh-fork launch path was dropping `ExecRequest.arg0`, so Linux
    `codex-linux-sandbox` arg0 dispatch did not run and zsh wrapper-mode
    could receive malformed arguments
    - the
    `turn_start_shell_zsh_fork_subcommand_decline_marks_parent_declined_v2`
    test uses the zsh exec bridge (which talks to the parent over a Unix
    socket), but Linux restricted sandbox seccomp denies `connect(2)`,
    causing timeouts on `ubuntu-24.04` x86/arm
    
    This PR makes the zsh tests consistently run against the intended
    vendored zsh fork and fixes/hardens the zsh-fork path so the Linux CI
    signal is meaningful.
    
    ## What Changed
    
    - Added a single shared test-only DotSlash file for the patched zsh fork
    at `codex-rs/exec-server/tests/suite/zsh` (analogous to the existing
    `bash` test resource).
    - Updated both app-server and exec-server zsh tests to use that shared
    DotSlash zsh (no duplicate zsh DotSlash file, no `CODEX_TEST_ZSH_PATH`
    dependency).
    - Updated the app-server zsh-fork test helper to resolve the shared
    DotSlash zsh and avoid silently falling back to host zsh.
    - Kept the app-server zsh-fork tests configured via `config.toml`, using
    a test wrapper path where needed to force `zsh -df` (and rewrite `-lc`
    to `-c`) for the subcommand-decline test.
    - Hardened the app-server subcommand-decline zsh-fork test for CI
    variability:
      - tolerate an extra `/responses` POST with a no-op mock response
    - tolerate non-target approval ordering while remaining strict on the
    two `/usr/bin/true` approvals and decline behavior
    - use `DangerFullAccess` on Linux for this one test because it validates
    zsh approval flow, not Linux sandbox socket restrictions
    - Fixed zsh-fork process launching on Linux by preserving `req.arg0` in
    `ZshExecBridge::execute_shell_request(...)` so `codex-linux-sandbox`
    arg0 dispatch continues to work.
    - Moved `maybe_run_zsh_exec_wrapper_mode()` under
    `arg0_dispatch_or_else(...)` in `app-server` and `cli` so wrapper-mode
    handling coexists correctly with arg0-dispatched helper modes.
    - Consolidated duplicated `dotslash -- fetch` resolution logic into
    shared test support (`core/tests/common/lib.rs`).
    - Updated `codex-rs/exec-server/tests/suite/accept_elicitation.rs` to
    use DotSlash zsh and hardened the zsh elicitation test for Bazel/zsh
    differences by:
      - resolving an absolute `git` path
      - running `git init --quiet .`
    - asserting success / `.git` creation instead of relying on banner text
    
    ## Verification
    
    - `cargo test -p codex-app-server turn_start_zsh_fork -- --nocapture`
    - `cargo test -p codex-exec-server accept_elicitation -- --nocapture`
    - `bazel test //codex-rs/exec-server:exec-server-all-test
    --test_output=streamed --test_arg=--nocapture
    --test_arg=accept_elicitation_for_prompt_rule_with_zsh`
    - CI (`rust-ci`) on the final cleaned commit: `Tests — ubuntu-24.04 -
    x86_64-unknown-linux-gnu` and `Tests — ubuntu-24.04-arm -
    aarch64-unknown-linux-gnu` passed in [run
    22291424358](https://github.com/openai/codex/actions/runs/22291424358)
  • chore: remove codex-core public protocol/shell re-exports (#12432)
    ## Why
    
    `codex-rs/core/src/lib.rs` re-exported a broad set of types and modules
    from `codex-protocol` and `codex-shell-command`. That made it easy for
    workspace crates to import those APIs through `codex-core`, which in
    turn hides dependency edges and makes it harder to reduce compile-time
    coupling over time.
    
    This change removes those public re-exports so call sites must import
    from the source crates directly. Even when a crate still depends on
    `codex-core` today, this makes dependency boundaries explicit and
    unblocks future work to drop `codex-core` dependencies where possible.
    
    ## What Changed
    
    - Removed public re-exports from `codex-rs/core/src/lib.rs` for:
    - `codex_protocol::protocol` and related protocol/model types (including
    `InitialHistory`)
      - `codex_protocol::config_types` (`protocol_config_types`)
    - `codex_shell_command::{bash, is_dangerous_command, is_safe_command,
    parse_command, powershell}`
    - Migrated workspace Rust call sites to import directly from:
      - `codex_protocol::protocol`
      - `codex_protocol::config_types`
      - `codex_protocol::models`
      - `codex_shell_command`
    - Added explicit `Cargo.toml` dependencies (`codex-protocol` /
    `codex-shell-command`) in crates that now import those crates directly.
    - Kept `codex-core` internal modules compiling by using `pub(crate)`
    aliases in `core/src/lib.rs` (internal-only, not part of the public
    API).
    - Updated the two utility crates that can already drop a `codex-core`
    dependency edge entirely:
      - `codex-utils-approval-presets`
      - `codex-utils-cli`
    
    ## Verification
    
    - `cargo test -p codex-utils-approval-presets`
    - `cargo test -p codex-utils-cli`
    - `cargo check --workspace --all-targets`
    - `just clippy`
  • bazel: fix snapshot parity for tests/*.rs rust_test targets (#11893)
    ## Summary
    - make `rust_test` targets generated from `tests/*.rs` use Cargo-style
    crate names (file stem) so snapshot names match Cargo (`all__...`
    instead of Bazel-derived names)
    - split lib vs `tests/*.rs` test env wiring in `codex_rust_crate` to
    keep existing lib snapshot behavior while applying Bazel
    runfiles-compatible workspace root for `tests/*.rs`
    - compute the `tests/*.rs` snapshot workspace root from package depth so
    `insta` resolves committed snapshots under Bazel `--noenable_runfiles`
    
    ## Validation
    - `bazelisk test //codex-rs/core:core-all-test
    --test_arg=suite::compact:: --cache_test_results=no`
    - `bazelisk test //codex-rs/core:core-all-test
    --test_arg=suite::compact_remote:: --cache_test_results=no`
  • feat: persist and restore codex app's tools after search (#11780)
    ### What changed
    1. Removed per-turn MCP selection reset in `core/src/tasks/mod.rs`.
    2. Added `SessionState::set_mcp_tool_selection(Vec<String>)` in
    `core/src/state/session.rs` for authoritative restore behavior (deduped,
    order-preserving, empty clears).
    3. Added rollout parsing in `core/src/codex.rs` to recover
    `active_selected_tools` from prior `search_tool_bm25` outputs:
       - tracks matching `call_id`s
       - parses function output text JSON
       - extracts `active_selected_tools`
       - latest valid payload wins
       - malformed/non-matching payloads are ignored
    4. Applied restore logic to resumed and forked startup paths in
    `core/src/codex.rs`.
    5. Updated instruction text to session/thread scope in
    `core/templates/search_tool/tool_description.md`.
    6. Expanded tests in `core/tests/suite/search_tool.rs`, plus unit
    coverage in:
       - `core/src/codex.rs`
       - `core/src/state/session.rs`
    
    ### Behavior after change
    1. Search activates matched tools.
    2. Additional searches union into active selection.
    3. Selection survives new turns in the same thread.
    4. Resume/fork restores selection from rollout history.
    5. Separate threads do not inherit selection unless forked.
  • core: snapshot tests for compaction requests, post-compaction layout, some additional compaction tests (#11487)
    This PR keeps compaction context-layout test coverage separate from
    runtime compaction behavior changes, so runtime logic review can stay
    focused.
    
    ## Included
    - Adds reusable context snapshot helpers in
    `core/tests/common/context_snapshot.rs` for rendering model-visible
    request/history shapes.
    - Standardizes helper naming for readability:
      - `format_request_input_snapshot`
      - `format_response_items_snapshot`
      - `format_labeled_requests_snapshot`
      - `format_labeled_items_snapshot`
    - Expands snapshot coverage for both local and remote compaction flows:
      - pre-turn auto-compaction
      - pre-turn failure/context-window-exceeded paths
      - mid-turn continuation compaction
      - manual `/compact` with and without prior user turns
    - Captures both sides where relevant:
      - compaction request shape
      - post-compaction history layout shape
    - Adds/uses shared request-inspection helpers so assertions target
    structured request content instead of ad-hoc JSON string parsing.
    - Aligns snapshots/assertions to current behavior and leaves explicit
    `TODO(ccunningham)` notes where behavior is known and intentionally
    deferred.
    
    ## Not Included
    - No runtime compaction logic changes.
    - No model-visible context/state behavior changes.
  • Remove test-support feature from codex-core and replace it with explicit test toggles (#11405)
    ## Why
    
    `codex-core` was being built in multiple feature-resolved permutations
    because test-only behavior was modeled as crate features. For a large
    crate, those permutations increase compile cost and reduce cache reuse.
    
    ## Net Change
    
    - Removed the `test-support` crate feature and related feature wiring so
    `codex-core` no longer needs separate feature shapes for test consumers.
    - Standardized cross-crate test-only access behind
    `codex_core::test_support`.
    - External test code now imports helpers from
    `codex_core::test_support`.
    - Underlying implementation hooks are kept internal (`pub(crate)`)
    instead of broadly public.
    
    ## Outcome
    
    - Fewer `codex-core` build permutations.
    - Better incremental cache reuse across test targets.
    - No intended production behavior change.
  • Remove deterministic_process_ids feature to avoid duplicate codex-core builds (#11393)
    ## Why
    
    `codex-core` enabled `deterministic_process_ids` through a self
    dev-dependency.
    That forced a second feature-resolved build of the same crate, which
    increased
    compile time and test latency.
    
    ## What Changed
    
    - Removed the `deterministic_process_ids` feature from
    `codex-rs/core/Cargo.toml`.
    - Removed the self dev-dependency on `codex-core` that enabled that
    feature.
    - Removed the Bazel `deterministic_process_ids` crate feature for
    `codex-core`.
    - Added a test-only `AtomicBool` override in unified exec process-id
    allocation.
    - Added a test-support setter for that override and re-exported it from
    `codex-core`.
    - Enabled deterministic process IDs in integration tests via
    `core_test_support` ctor.
    
    ## Behavior
    
    - Production behavior remains random process IDs.
    - Unit tests remain deterministic via `cfg(test)`.
    - Integration tests remain deterministic via explicit test-support
    initialization.
    
    ## Validation
    
    - `just fmt`
    - `cargo test -p codex-core unified_exec::`
    - `cargo test -p codex-core --test all unified_exec -- --test-threads=1`
    - `cargo tree -p codex-core -e features` (verified the removed feature
    path)
  • Update tests to stop using sse_completed fixture (#10638)
    Summary:
    - replace the `sse_completed` fixture and related JSON template with
    direct `responses::ev_completed` payload builders
    - cascade the new SSE helpers through all affected core tests for
    consistency and clarity
    - remove legacy fixtures that were no longer needed once the helpers are
    in place
    
    Testing:
    - Not run (not requested)
  • feat: replace custom mcp-types crate with equivalents from rmcp (#10349)
    We started working with MCP in Codex before
    https://crates.io/crates/rmcp was mature, so we had our own crate for
    MCP types that was generated from the MCP schema:
    
    
    https://github.com/openai/codex/blob/8b95d3e082376f4cb23e92641705a22afb28a9da/codex-rs/mcp-types/README.md
    
    Now that `rmcp` is more mature, it makes more sense to use their MCP
    types in Rust, as they handle details (like the `_meta` field) that our
    custom version ignored. Though one advantage that our custom types had
    is that our generated types implemented `JsonSchema` and `ts_rs::TS`,
    whereas the types in `rmcp` do not. As such, part of the work of this PR
    is leveraging the adapters between `rmcp` types and the serializable
    types that are API for us (app server and MCP) introduced in #10356.
    
    Note this PR results in a number of changes to
    `codex-rs/app-server-protocol/schema`, which merit special attention
    during review. We must ensure that these changes are still
    backwards-compatible, which is possible because we have:
    
    ```diff
    - export type CallToolResult = { content: Array<ContentBlock>, isError?: boolean, structuredContent?: JsonValue, };
    + export type CallToolResult = { content: Array<JsonValue>, structuredContent?: JsonValue, isError?: boolean, _meta?: JsonValue, };
    ```
    
    so `ContentBlock` has been replaced with the more general `JsonValue`.
    Note that `ContentBlock` was defined as:
    
    ```typescript
    export type ContentBlock = TextContent | ImageContent | AudioContent | ResourceLink | EmbeddedResource;
    ```
    
    so the deletion of those individual variants should not be a cause of
    great concern.
    
    Similarly, we have the following change in
    `codex-rs/app-server-protocol/schema/typescript/Tool.ts`:
    
    ```
    - export type Tool = { annotations?: ToolAnnotations, description?: string, inputSchema: ToolInputSchema, name: string, outputSchema?: ToolOutputSchema, title?: string, };
    + export type Tool = { name: string, title?: string, description?: string, inputSchema: JsonValue, outputSchema?: JsonValue, annotations?: JsonValue, icons?: Array<JsonValue>, _meta?: JsonValue, };
    ```
    
    so:
    
    - `annotations?: ToolAnnotations` ➡️ `JsonValue`
    - `inputSchema: ToolInputSchema` ➡️ `JsonValue`
    - `outputSchema?: ToolOutputSchema` ➡️ `JsonValue`
    
    and two new fields: `icons?: Array<JsonValue>, _meta?: JsonValue`
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/10349).
    * #10357
    * __->__ #10349
    * #10356
  • fix: leverage codex_utils_cargo_bin() in codex-rs/core/tests/suite (#8887)
    This eliminates our dependency on the `escargot` crate and better
    prepares us for Bazel builds: https://github.com/openai/codex/pull/8875.
  • chore: unify conversation with thread name (#8830)
    Done and verified by Codex + refactor feature of RustRover
  • feat: introduce codex-utils-cargo-bin as an alternative to assert_cmd::Command (#8496)
    This PR introduces a `codex-utils-cargo-bin` utility crate that
    wraps/replaces our use of `assert_cmd::Command` and
    `escargot::CargoBuild`.
    
    As you can infer from the introduction of `buck_project_root()` in this
    PR, I am attempting to make it possible to build Codex under
    [Buck2](https://buck2.build) as well as `cargo`. With Buck2, I hope to
    achieve faster incremental local builds (largely due to Buck2's
    [dice](https://buck2.build/docs/insights_and_knowledge/modern_dice/)
    build strategy, as well as benefits from its local build daemon) as well
    as faster CI builds if we invest in remote execution and caching.
    
    See
    https://buck2.build/docs/getting_started/what_is_buck2/#why-use-buck2-key-advantages
    for more details about the performance advantages of Buck2.
    
    Buck2 enforces stronger requirements in terms of build and test
    isolation. It discourages assumptions about absolute paths (which is key
    to enabling remote execution). Because the `CARGO_BIN_EXE_*` environment
    variables that Cargo provides are absolute paths (which
    `assert_cmd::Command` reads), this is a problem for Buck2, which is why
    we need this `codex-utils-cargo-bin` utility.
    
    My WIP-Buck2 setup sets the `CARGO_BIN_EXE_*` environment variables
    passed to a `rust_test()` build rule as relative paths.
    `codex-utils-cargo-bin` will resolve these values to absolute paths,
    when necessary.
    
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/8496).
    * #8498
    * __->__ #8496
  • chore: migrate from Config::load_from_base_config_with_overrides to ConfigBuilder (#8276)
    https://github.com/openai/codex/pull/8235 introduced `ConfigBuilder` and
    this PR updates all call non-test call sites to use it instead of
    `Config::load_from_base_config_with_overrides()`.
    
    This is important because `load_from_base_config_with_overrides()` uses
    an empty `ConfigRequirements`, which is a reasonable default for testing
    so the tests are not influenced by the settings on the host. This method
    is now guarded by `#[cfg(test)]` so it cannot be used by business logic.
    
    Because `ConfigBuilder::build()` is `async`, many of the test methods
    had to be migrated to be `async`, as well. On the bright side, this made
    it possible to eliminate a bunch of `block_on_future()` stuff.
  • Fix unified_exec on windows (#7620)
    Fix unified_exec on windows
    
    Requires removal of PSUEDOCONSOLE_INHERIT_CURSOR flag so child processed
    don't attempt to wait for cursor position response (and timeout).
    
    
    https://github.com/wezterm/wezterm/compare/main...pakrym:wezterm:PSUEDOCONSOLE_INHERIT_CURSOR?expand=1
    
    ---------
    
    Co-authored-by: pakrym-oai <pakrym@openai.com>
  • feat(core) Add login to shell_command tool (#6846)
    ## Summary
    Adds the `login` parameter to the `shell_command` tool - optional,
    defaults to true.
    
    ## Testing
    - [x] Tested locally
  • [App-server] v2 for account/updated and account/logout (#6175)
    V2 for `account/updated` and `account/logout` for app server. correspond
    to old `authStatusChange` and `LogoutChatGpt` respectively. Followup PRs
    will make other v2 endpoints call `account/updated` instead of
    `authStatusChange` too.
  • Add ItemStarted/ItemCompleted events for UserInputItem (#5306)
    Adds a new ItemStarted event and delivers UserMessage as the first item
    type (more to come).
    
    
    Renames `InputItem` to `UserInput` considering we're using the `Item`
    suffix for actual items.
  • test: reduce time dependency on test harness (#5053)
    Tightened the CLI integration tests to stop relying on wall-clock
    sleeps—new fs watcher helper waits for session files instead of timing
    out, and SSE mocks/fixtures make the flows deterministic.
  • Make output assertions more explicit (#4784)
    Match using precise regexes.
  • chore: refactor tool handling (#4510)
    # Tool System Refactor
    
    - Centralizes tool definitions and execution in `core/src/tools/*`:
    specs (`spec.rs`), handlers (`handlers/*`), router (`router.rs`),
    registry/dispatch (`registry.rs`), and shared context (`context.rs`).
    One registry now builds the model-visible tool list and binds handlers.
    - Router converts model responses to tool calls; Registry dispatches
    with consistent telemetry via `codex-rs/otel` and unified error
    handling. Function, Local Shell, MCP, and experimental `unified_exec`
    all flow through this path; legacy shell aliases still work.
    - Rationale: reduce per‑tool boilerplate, keep spec/handler in sync, and
    make adding tools predictable and testable.
    
    Example: `read_file`
    - Spec: `core/src/tools/spec.rs` (see `create_read_file_tool`,
    registered by `build_specs`).
    - Handler: `core/src/tools/handlers/read_file.rs` (absolute `file_path`,
    1‑indexed `offset`, `limit`, `L#: ` prefixes, safe truncation).
    - E2E test: `core/tests/suite/read_file.rs` validates the tool returns
    the requested lines.
    
    ## Next steps:
    - Decompose `handle_container_exec_with_params` 
    - Add parallel tool calls
  • Add codex exec testing helpers (#4254)
    Add a shortcut to create working directories and run codex exec with
    fake server.
  • make tests pass cleanly in sandbox (#4067)
    This changes the reqwest client used in tests to be sandbox-friendly,
    and skips a bunch of other tests that don't work inside the
    sandbox/without network.
  • Add notifier tests (#4064)
    Proposal:
    1. Use anyhow for tests and avoid unwrap
    2. Extract a helper for starting a test instance of codex
  • [tools] Add apply_patch tool (#2303)
    ## Summary
    We've been seeing a number of issues and reports with our synthetic
    `apply_patch` tool, e.g. #802. Let's make this a real tool - in my
    anecdotal testing, it's critical for GPT-OSS models, but I'd like to
    make it the standard across GPT-5 and codex models as well.
    
    ## Testing
    - [x] Tested locally
    - [x] Integration test
  • Added allow-expect-in-tests / allow-unwrap-in-tests (#2328)
    This PR:
    * Added the clippy.toml to configure allowable expect / unwrap usage in
    tests
    * Removed as many expect/allow lines as possible from tests
    * moved a bunch of allows to expects where possible
    
    Note: in integration tests, non `#[test]` helper functions are not
    covered by this so we had to leave a few lingering `expect(expect_used`
    checks around
  • chore: introduce ConversationManager as a clearinghouse for all conversations (#2240)
    This PR does two things because after I got deep into the first one I
    started pulling on the thread to the second:
    
    - Makes `ConversationManager` the place where all in-memory
    conversations are created and stored. Previously, `MessageProcessor` in
    the `codex-mcp-server` crate was doing this via its `session_map`, but
    this is something that should be done in `codex-core`.
    - It unwinds the `ctrl_c: tokio::sync::Notify` that was threaded
    throughout our code. I think this made sense at one time, but now that
    we handle Ctrl-C within the TUI and have a proper `Op::Interrupt` event,
    I don't think this was quite right, so I removed it. For `codex exec`
    and `codex proto`, we now use `tokio::signal::ctrl_c()` directly, but we
    no longer make `Notify` a field of `Codex` or `CodexConversation`.
    
    Changes of note:
    
    - Adds the files `conversation_manager.rs` and `codex_conversation.rs`
    to `codex-core`.
    - `Codex` and `CodexSpawnOk` are no longer exported from `codex-core`:
    other crates must use `CodexConversation` instead (which is created via
    `ConversationManager`).
    - `core/src/codex_wrapper.rs` has been deleted in favor of
    `ConversationManager`.
    - `ConversationManager::new_conversation()` returns `NewConversation`,
    which is in line with the `new_conversation` tool we want to add to the
    MCP server. Note `NewConversation` includes `SessionConfiguredEvent`, so
    we eliminate checks in cases like `codex-rs/core/tests/client.rs` to
    verify `SessionConfiguredEvent` is the first event because that is now
    internal to `ConversationManager`.
    - Quite a bit of code was deleted from
    `codex-rs/mcp-server/src/message_processor.rs` since it no longer has to
    manage multiple conversations itself: it goes through
    `ConversationManager` instead.
    - `core/tests/live_agent.rs` has been deleted because I had to update a
    bunch of tests and all the tests in here were ignored, and I don't think
    anyone ever ran them, so this was just technical debt, at this point.
    - Removed `notify_on_sigint()` from `util.rs` (and in a follow-up, I
    hope to refactor the blandly-named `util.rs` into more descriptive
    files).
    - In general, I started replacing local variables named `codex` as
    `conversation`, where appropriate, though admittedly I didn't do it
    through all the integration tests because that would have added a lot of
    noise to this PR.
    
    
    
    
    ---
    [//]: # (BEGIN SAPLING FOOTER)
    Stack created with [Sapling](https://sapling-scm.com). Best reviewed
    with [ReviewStack](https://reviewstack.dev/openai/codex/pull/2240).
    * #2264
    * #2263
    * __->__ #2240
  • Re-add markdown streaming (#2029)
    Wait for newlines, then render markdown on a line by line basis. Word wrap it for the current terminal size and then spit it out line by line into the UI. Also adds tests and fixes some UI regressions.
  • [core] Allow resume after client errors (#2053)
    ## Summary
    Allow tui conversations to resume after the client fails out of retries.
    I tested this with exec / mocked api failures as well, and it appears to
    be fine. But happy to add an exec integration test as well!
    
    ## Testing
    - [x] Added integration test
    - [x] Tested locally
  • fix: create separate test_support crates to eliminate #[allow(dead_code)] (#1667)
    Because of a quirk of how implementation tests work in Rust, we had a
    number of `#[allow(dead_code)]` annotations that were misleading because
    the functions _were_ being used, just not by all integration tests in a
    `tests/` folder, so when compiling the test that did not use the
    function, clippy would complain that it was unused.
    
    This fixes things by create a "test_support" crate under the `tests/`
    folder that is imported as a dev dependency for the respective crate.