Commit Graph

153 Commits

  • Separate interactive and non-interactive sessions (#4612)
    Do not show exec session in VSCode/TUI selector.
  • Support CODEX_API_KEY for codex exec (#4615)
    Allows to set API key per invocation of `codex exec`
  • chore: sanbox extraction (#4286)
    # Extract and Centralize Sandboxing
    - Goal: Improve safety and clarity by centralizing sandbox planning and
    execution.
      - Approach:
    - Add planner (ExecPlan) and backend registry (Direct/Seatbelt/Linux)
    with run_with_plan.
    - Refactor codex.rs to plan-then-execute; handle failures/escalation via
    the plan.
    - Delegate apply_patch to the codex binary and run it with an empty env
    for determinism.
  • fix: remove mcp-types from app server protocol (#4537)
    We continue the separation between `codex app-server` and `codex
    mcp-server`.
    
    In particular, we introduce a new crate, `codex-app-server-protocol`,
    and migrate `codex-rs/protocol/src/mcp_protocol.rs` into it, renaming it
    `codex-rs/app-server-protocol/src/protocol.rs`.
    
    Because `ConversationId` was defined in `mcp_protocol.rs`, we move it
    into its own file, `codex-rs/protocol/src/conversation_id.rs`, and
    because it is referenced in a ton of places, we have to touch a lot of
    files as part of this PR.
    
    We also decide to get away from proper JSON-RPC 2.0 semantics, so we
    also introduce `codex-rs/app-server-protocol/src/jsonrpc_lite.rs`, which
    is basically the same `JSONRPCMessage` type defined in `mcp-types`
    except with all of the `"jsonrpc": "2.0"` removed.
    
    Getting rid of `"jsonrpc": "2.0"` makes our serialization logic
    considerably simpler, as we can lean heavier on serde to serialize
    directly into the wire format that we use now.
  • OpenTelemetry events (#2103)
    ### Title
    
    ## otel
    
    Codex can emit [OpenTelemetry](https://opentelemetry.io/) **log events**
    that
    describe each run: outbound API requests, streamed responses, user
    input,
    tool-approval decisions, and the result of every tool invocation. Export
    is
    **disabled by default** so local runs remain self-contained. Opt in by
    adding an
    `[otel]` table and choosing an exporter.
    
    ```toml
    [otel]
    environment = "staging"   # defaults to "dev"
    exporter = "none"          # defaults to "none"; set to otlp-http or otlp-grpc to send events
    log_user_prompt = false    # defaults to false; redact prompt text unless explicitly enabled
    ```
    
    Codex tags every exported event with `service.name = "codex-cli"`, the
    CLI
    version, and an `env` attribute so downstream collectors can distinguish
    dev/staging/prod traffic. Only telemetry produced inside the
    `codex_otel`
    crate—the events listed below—is forwarded to the exporter.
    
    ### Event catalog
    
    Every event shares a common set of metadata fields: `event.timestamp`,
    `conversation.id`, `app.version`, `auth_mode` (when available),
    `user.account_id` (when available), `terminal.type`, `model`, and
    `slug`.
    
    With OTEL enabled Codex emits the following event types (in addition to
    the
    metadata above):
    
    - `codex.api_request`
      - `cf_ray` (optional)
      - `attempt`
      - `duration_ms`
      - `http.response.status_code` (optional)
      - `error.message` (failures)
    - `codex.sse_event`
      - `event.kind`
      - `duration_ms`
      - `error.message` (failures)
      - `input_token_count` (completion only)
      - `output_token_count` (completion only)
      - `cached_token_count` (completion only, optional)
      - `reasoning_token_count` (completion only, optional)
      - `tool_token_count` (completion only)
    - `codex.user_prompt`
      - `prompt_length`
      - `prompt` (redacted unless `log_user_prompt = true`)
    - `codex.tool_decision`
      - `tool_name`
      - `call_id`
    - `decision` (`approved`, `approved_for_session`, `denied`, or `abort`)
      - `source` (`config` or `user`)
    - `codex.tool_result`
      - `tool_name`
      - `call_id`
      - `arguments`
      - `duration_ms` (execution time for the tool)
      - `success` (`"true"` or `"false"`)
      - `output`
    
    ### Choosing an exporter
    
    Set `otel.exporter` to control where events go:
    
    - `none` – leaves instrumentation active but skips exporting. This is
    the
      default.
    - `otlp-http` – posts OTLP log records to an OTLP/HTTP collector.
    Specify the
      endpoint, protocol, and headers your collector expects:
    
      ```toml
      [otel]
      exporter = { otlp-http = {
        endpoint = "https://otel.example.com/v1/logs",
        protocol = "binary",
        headers = { "x-otlp-api-key" = "${OTLP_TOKEN}" }
      }}
      ```
    
    - `otlp-grpc` – streams OTLP log records over gRPC. Provide the endpoint
    and any
      metadata headers:
    
      ```toml
      [otel]
      exporter = { otlp-grpc = {
        endpoint = "https://otel.example.com:4317",
        headers = { "x-otlp-meta" = "abc123" }
      }}
      ```
    
    If the exporter is `none` nothing is written anywhere; otherwise you
    must run or point to your
    own collector. All exporters run on a background batch worker that is
    flushed on
    shutdown.
    
    If you build Codex from source the OTEL crate is still behind an `otel`
    feature
    flag; the official prebuilt binaries ship with the feature enabled. When
    the
    feature is disabled the telemetry hooks become no-ops so the CLI
    continues to
    function without the extra dependencies.
    
    ---------
    
    Co-authored-by: Anton Panasenko <apanasenko@openai.com>
  • [MCP] Add experimental support for streamable HTTP MCP servers (#4317)
    This PR adds support for streamable HTTP MCP servers when the
    `experimental_use_rmcp_client` is enabled.
    
    To set one up, simply add a new mcp server config with the url:
    ```
    [mcp_servers.figma]
    url = "http://127.0.0.1:3845/mcp"
    ```
    
    It also supports an optional `bearer_token` which will be provided in an
    authorization header. The full oauth flow is not supported yet.
    
    The config parsing will throw if it detects that the user mixed and
    matched config fields (like command + bearer token or url + env).
    
    The best way to review it is to review `core/src` and then
    `rmcp-client/src/rmcp_client.rs` first. The rest is tests and
    propagating the `Transport` struct around the codebase.
    
    Example with the Figma MCP:
    <img width="5084" height="1614" alt="CleanShot 2025-09-26 at 13 35 40"
    src="https://github.com/user-attachments/assets/eaf2771e-df3e-4300-816b-184d7dec5a28"
    />
  • update composer + user message styling (#4240)
    Changes:
    
    - the composer and user messages now have a colored background that
    stretches the entire width of the terminal.
    - the prompt character was changed from a cyan `▌` to a bold `›`.
    - the "working" shimmer now follows the "dark gray" color of the
    terminal, better matching the terminal's color scheme
    
    | Terminal + Background        | Screenshot |
    |------------------------------|------------|
    | iTerm with dark bg | <img width="810" height="641" alt="Screenshot
    2025-09-25 at 11 44 52 AM"
    src="https://github.com/user-attachments/assets/1317e579-64a9-4785-93e6-98b0258f5d92"
    /> |
    | iTerm with light bg | <img width="845" height="540" alt="Screenshot
    2025-09-25 at 11 46 29 AM"
    src="https://github.com/user-attachments/assets/e671d490-c747-4460-af0b-3f8d7f7a6b8e"
    /> |
    | iTerm with color bg | <img width="825" height="564" alt="Screenshot
    2025-09-25 at 11 47 12 AM"
    src="https://github.com/user-attachments/assets/141cda1b-1164-41d5-87da-3be11e6a3063"
    /> |
    | Terminal.app with dark bg | <img width="577" height="367"
    alt="Screenshot 2025-09-25 at 11 45 22 AM"
    src="https://github.com/user-attachments/assets/93fc4781-99f7-4ee7-9c8e-3db3cd854fe5"
    /> |
    | Terminal.app with light bg | <img width="577" height="367"
    alt="Screenshot 2025-09-25 at 11 46 04 AM"
    src="https://github.com/user-attachments/assets/19bf6a3c-91e0-447b-9667-b8033f512219"
    /> |
    | Terminal.app with color bg | <img width="577" height="367"
    alt="Screenshot 2025-09-25 at 11 45 50 AM"
    src="https://github.com/user-attachments/assets/dd7c4b5b-342e-4028-8140-f4e65752bd0b"
    /> |
  • [MCP] Introduce an experimental official rust sdk based mcp client (#4252)
    The [official Rust
    SDK](https://github.com/modelcontextprotocol/rust-sdk/tree/57fc428c578a1a3fe851ee0838bf068bda120eb3)
    has come a long way since we first started our mcp client implementation
    5 months ago and, today, it is much more complete than our own
    stdio-only implementation.
    
    This PR introduces a new config flag `experimental_use_rmcp_client`
    which will use a new mcp client powered by the sdk instead of our own.
    
    To keep this PR simple, I've only implemented the same stdio MCP
    functionality that we had but will expand on it with future PRs.
    
    ---------
    
    Co-authored-by: pakrym-oai <pakrym@openai.com>
  • ref: state - 2 (#4229)
    Extracting tasks in a module and start abstraction behind a Trait (more
    to come on this but each task will be tackled in a dedicated PR)
    The goal was to drop the ActiveTask and to have a (potentially) set of
    tasks during each turn
  • Actually mount sse once (#4264)
    Mock server was responding with the same result many times.
  • Add codex exec testing helpers (#4254)
    Add a shortcut to create working directories and run codex exec with
    fake server.
  • make tests pass cleanly in sandbox (#4067)
    This changes the reqwest client used in tests to be sandbox-friendly,
    and skips a bunch of other tests that don't work inside the
    sandbox/without network.
  • Add Reset in for rate limits (#4111)
    - Parse the headers
    - Reorganize the struct because it's getting too long
    - show the resets at in the tui
    
    <img width="324" height="79" alt="image"
    src="https://github.com/user-attachments/assets/ca15cd48-f112-4556-91ab-1e3a9bc4683d"
    />
  • Send limits when getting rate limited (#4102)
    Users need visibility on rate limits when they are rate limited.
  • Add exec output-schema parameter (#4079)
    Adds structured output to `exec` via the `--structured-output`
    parameter.
  • chore: compact do not modify instructions (#4088)
    Keep the developer instruction and insert the summarisation message as a
    user message instead
  • Use TestCodex builder in stream retry tests (#4096)
    ## Summary
    - refactor the stream retry integration tests to construct conversations
    through `TestCodex`
    - remove bespoke config and tempdir setup now handled by the shared
    builder
    
    ## Testing
    - cargo test -p codex-core --test all
    stream_error_allows_next_turn::continue_after_stream_error
    - cargo test -p codex-core --test all
    stream_no_completed::retries_on_early_close
    
    ------
    https://chatgpt.com/codex/tasks/task_i_68d2b94d83888320bc75a0bc3bd77b49
  • Add notifier tests (#4064)
    Proposal:
    1. Use anyhow for tests and avoid unwrap
    2. Extract a helper for starting a test instance of codex
  • feat: update default (#4076)
    Changes:
    - Default model and docs now use gpt-5-codex. 
    - Disables the GPT-5 Codex NUX by default.
    - Keeps presets available for API key users.
  • chore: clippy on redundant closure (#4058)
    Add redundant closure clippy rules and let Codex fix it by minimising
    FQP
  • chore: unify cargo versions (#4044)
    Unify cargo versions at root
  • Forward Rate limits to the UI (#3965)
    We currently get information about rate limits in the response headers.
    We want to forward them to the clients to have better transparency.
    UI/UX plans have been discussed and this information is needed.
  • Use helpers instead of fixtures (#3888)
    Move to using test helper method everywhere.
  • fix: ensure cwd for conversation and sandbox are separate concerns (#3874)
    Previous to this PR, both of these functions take a single `cwd`:
    
    
    https://github.com/openai/codex/blob/71038381aa0f51aa62e1a2bcc7cbf26a05b141f3/codex-rs/core/src/seatbelt.rs#L19-L25
    
    
    https://github.com/openai/codex/blob/71038381aa0f51aa62e1a2bcc7cbf26a05b141f3/codex-rs/core/src/landlock.rs#L16-L23
    
    whereas `cwd` and `sandbox_cwd` should be set independently (fixed in
    this PR).
    
    Added `sandbox_distinguishes_command_and_policy_cwds()` to
    `codex-rs/exec/tests/suite/sandbox.rs` to verify this.
  • Make ESC button work when auto-compaction (#3857)
    Only emit a task finished when the compaction comes from a `/compact`
  • Add dev message upon review out (#3758)
    Proposal: We want to record a dev message like so:
    
    ```
    {
          "type": "message",
          "role": "user",
          "content": [
            {
              "type": "input_text",
              "text": "<user_action>
      <context>User initiated a review task. Here's the full review output from reviewer model. User may select one or more comments to resolve.</context>
      <action>review</action>
      <results>
      {findings_str}
      </results>
    </user_action>"
            }
          ]
        },
    ```
    
    Without showing in the chat transcript.
    
    Rough idea, but it fixes issue where the user finishes a review thread,
    and asks the parent "fix the rest of the review issues" thinking that
    the parent knows about it.
    
    ### Question: Why not a tool call?
    
    Because the agent didn't make the call, it was a human. + we haven't
    implemented sub-agents yet, and we'll need to think about the way we
    represent these human-led tool calls for the agent.
  • Review mode core updates (#3701)
    1. Adds the environment prompt (including cwd) to review thread
    2. Prepends the review prompt as a user message (temporary fix so the
    instructions are not replaced on backend)
    3. Sets reasoning to low
    4. Sets default review model to `gpt-5-codex`
  • fix: Record EnvironmentContext in SendUserTurn (#3678)
    ## Summary
    SendUserTurn has not been correctly handling updates to policies. While
    the tui protocol handles this in `Op::OverrideTurnContext`, the
    SendUserTurn should be appending `EnvironmentContext` messages when the
    sandbox settings change. MCP client behavior should match the cli
    behavior, so we update `SendUserTurn` message to match.
    
    ## Testing
    - [x] Added prompt caching tests
  • Revert "refactor transcript view to handle HistoryCells" (#3614)
    Reverts openai/codex#3538
    It panics on forking first message. It also calculates the index in a
    wrong way.
  • enable-resume (#3537)
    Adding the ability to resume conversations.
    we have one verb `resume`. 
    
    Behavior:
    
    `tui`:
    `codex resume`: opens session picker
    `codex resume --last`: continue last message
    `codex resume <session id>`: continue conversation with `session id`
    
    `exec`:
    `codex resume --last`: continue last conversation
    `codex resume <session id>`: continue conversation with `session id`
    
    Implementation:
    - I added a function to find the path in `~/.codex/sessions/` with a
    `UUID`. This is helpful in resuming with session id.
    - Added the above mentioned flags
    - Added lots of testing
  • Fix flaky windows test (#3564)
    There are exactly 4 types of flaky tests in Windows x86 right now:
    
    1. `review_input_isolated_from_parent_history` => Times out waiting for
    closing events
    2. `review_does_not_emit_agent_message_on_structured_output` => Times
    out waiting for closing events
    3. `auto_compact_runs_after_token_limit_hit` => Times out waiting for
    closing events
    4. `auto_compact_runs_after_token_limit_hit` => Also has a problem where
    auto compact should add a third request, but receives 4 requests.
    
    1, 2, and 3 seem to be solved with increasing threads on windows runner
    from 2 -> 4.
    
    Don't know yet why # 4 is happening, but probably also because of
    WireMock issues on windows causing races.
  • Include command output when sending timeout to model (#3576)
    Being able to see the output helps the model decide how to handle the
    timeout.
  • Handle resuming/forking after compact (#3533)
    We need to construct the history different when compact happens. For
    this, we need to just consider the history after compact and convert
    compact to a response item.
    
    This needs to change and use `build_compact_history` when this #3446 is
    merged.
  • refactor transcript view to handle HistoryCells (#3538)
    No (intended) functional change.
    
    This refactors the transcript view to hold a list of HistoryCells
    instead of a list of Lines. This simplifies and makes much of the logic
    more robust, as well as laying the groundwork for future changes, e.g.
    live-updating history cells in the transcript.
    
    Similar to #2879 in goal. Fixes #2755.
  • Always request encrypted cot (#3539)
    Otherwise future requests will fail with 500
  • Review Mode (Core) (#3401)
    ## 📝 Review Mode -- Core
    
    This PR introduces the Core implementation for Review mode:
    
    - New op `Op::Review { prompt: String }:` spawns a child review task
    with isolated context, a review‑specific system prompt, and a
    `Config.review_model`.
    - `EnteredReviewMode`: emitted when the child review session starts.
    Every event from this point onwards reflects the review session.
    - `ExitedReviewMode(Option<ReviewOutputEvent>)`: emitted when the review
    finishes or is interrupted, with optional structured findings:
    
    ```json
    {
      "findings": [
        {
          "title": "<≤ 80 chars, imperative>",
          "body": "<valid Markdown explaining *why* this is a problem; cite files/lines/functions>",
          "confidence_score": <float 0.0-1.0>,
          "priority": <int 0-3>,
          "code_location": {
            "absolute_file_path": "<file path>",
            "line_range": {"start": <int>, "end": <int>}
          }
        }
      ],
      "overall_correctness": "patch is correct" | "patch is incorrect",
      "overall_explanation": "<1-3 sentence explanation justifying the overall_correctness verdict>",
      "overall_confidence_score": <float 0.0-1.0>
    }
    ```
    
    ## Questions
    
    ### Why separate out its own message history?
    
    We want the review thread to match the training of our review models as
    much as possible -- that means using a custom prompt, removing user
    instructions, and starting a clean chat history.
    
    We also want to make sure the review thread doesn't leak into the parent
    thread.
    
    ### Why do this as a mode, vs. sub-agents?
    
    1. We want review to be a synchronous task, so it's fine for now to do a
    bespoke implementation.
    2. We're still unclear about the final structure for sub-agents. We'd
    prefer to land this quickly and then refactor into sub-agents without
    rushing that implementation.
  • Add Azure Responses API workaround (#3528)
    Azure Responses API doesn't work well with store:false and response
    items.
    
    If store = false and id is sent an error is thrown that ID is not found
    If store = false and id is not sent an error is thrown that ID is
    required
    
    Add detection for Azure urls and add a workaround to preserve reasoning
    item IDs and send store:true
  • feat: context compaction (#3446)
    ## Compact feature:
    1. Stops the model when the context window become too large
    2. Add a user turn, asking for the model to summarize
    3. Build a bridge that contains all the previous user message + the
    summary. Rendered from a template
    4. Start sampling again from a clean conversation with only that bridge
  • feat: reasoning effort as optional (#3527)
    Allow the reasoning effort to be optional
  • bug: fix model save (#3525)
    Fix those 2 behaviors:
    1. The model does not get saved if we don't CTRL + S
    2. The reasoning effort get saved
  • Add Compact and Turn Context to the rollout items (#3444)
    Adding compact and turn context to the rollout items
    
    based on #3440
  • NIT unified exec (#3479)
    Fix the default value of the experimental flag of unified_exec