codex

[codex] Propagate traces through exec-server HTTP (#30117 )

Fixes distributed trace continuity across exec-server JSON-RPC HTTP
egress by adding an executor client span and injecting its W3C context
through a reusable `codex-otel` helper.

This preserves the caller trace across core/tool → executor →
provider/MCP instead of dropping parentage at raw reqwest.

Note that this doesn't include the websocket path, which is needed to
really get the full story but at least we cover the basic http path with
this change.

Tom · 2026-06-25 23:22:22 +00:00

8ce931ab76

[codex] Record exec-server lifecycle metrics (#27467 )

## Summary

- Record bounded connection, request, and process lifecycle metrics.
- Report active gauges from callbacks on every collection, including
delta exports.
- Serialize active-count updates so concurrent starts and finishes
cannot publish stale values.
- Serialize process exit, explicit termination, and shutdown through the
process registry so exactly one completion result wins.
- Keep the implementation small with single-owner RAII guards and one
real OTLP/HTTP integration test using the existing `wiremock`
dependency.

## Root cause

Process exit and session shutdown previously used cloned completion
state. That avoided duplicate emission, but it duplicated lifecycle
ownership and made the ordering harder to reason about. The process
registry mutex already defines the lifecycle ordering, so the final
implementation stores the metric guard and termination flag directly on
the process entry. Whichever path claims the entry first owns the
completion result.

Production metric export uses delta temporality. Event-only synchronous
gauge recordings disappear after the next collection when no count
changes, so active counts now use observable callbacks that report
current state on every collection.

The cleanup also removes the constant `result="accepted"` connection
tag, redundant route and response assertions, a custom HTTP collector,
and fallback initialization machinery that did not add behavior.

## Stack

Review and land this stack in order:

1. #27466 — trace exec-server JSON-RPC requests
2. #27467 — record bounded connection, request, and process lifecycle
metrics **(this PR)**
3. #27470 — observe remote registration and Noise rendezvous lifecycle

## Validation

- `just test -p codex-exec-server --lib` (158 passed)
- `just test -p codex-cli --test exec_server` (3 passed)
- `just test -p codex-otel
observable_gauge_is_collected_on_every_delta_snapshot` (1 passed)
- `CARGO_BUILD_JOBS=1 just fix -p codex-otel -p codex-exec-server`
- `just fmt`
- `git diff --check`

richardopenai · 2026-06-25 11:02:11 -07:00

2dec46e30a

feat: use run agent task auth for inference (#19051 )

## Stack

This is PR 3 of the simplified HAI single-run-task stack:

- [#19047](https://github.com/openai/codex/pull/19047) Agent Identity
assertion and task-registration primitives, including the shared
run-task helper used by existing Agent Identity JWT auth.
- [#19049](https://github.com/openai/codex/pull/19049)
Disabled-by-default ChatGPT auth opt-in that provisions/reuses persisted
Agent Identity runtime auth and its single run task.
- [#19051](https://github.com/openai/codex/pull/19051) Run-scoped
provider auth that uses one backend-owned task id for first-party
inference and compaction requests.

[#19054](https://github.com/openai/codex/pull/19054) collapsed out of
the active stack because the simplified design no longer needs a
separate background/control-plane task helper.

## Summary

This PR moves Agent Identity usage into provider auth resolution. That
keeps `AgentAssertion` auth tied to first-party OpenAI provider requests
instead of applying a late session-wide override that could affect
local, custom, Bedrock, API-key, or external-bearer providers.

What changed:

- adds a small `ProviderAuthScope` struct carrying the run auth policy
and session source needed by provider-scoped auth resolution
- lets `Session` opt the existing `ModelClient` into `ChatGptAuth`
policy when `use_agent_identity` is enabled, without adding a second
model-client constructor
- resolves Agent Identity only for first-party OpenAI provider auth
paths
- uses the persisted run task id from the `AgentIdentityAuth` record to
build `AgentAssertion` auth for Responses requests
- routes shared request setup through scoped provider auth so unary
compact requests use the same run-task assertion path as inference turns
- keeps local/custom/Bedrock/env-key/external-bearer provider auth
unchanged
- lets missing run-task state surface through the existing model-request
error path instead of silently falling back to bearer auth

This PR intentionally does not create thread-scoped, target-scoped, or
background-scoped task identities. The run task is the only task Codex
registers in this POC shape.

## Testing

- `just test -p codex-model-provider`
- `just test -p codex-core client::tests::provider_auth_scope_uses`
- `just test -p codex-core remote_compact_uses_agent_identity_assertion`

Adrian · 2026-06-24 22:31:41 -07:00

51864b0b4b

auth: move domain mode below app wire types (#29721 )

## Why

Authentication mode is a domain concept used by login, model selection,
telemetry, and transports. Keeping the canonical type in app-server
protocol forces those lower-level crates to depend on an unrelated wire
API.

## What changed

- Added canonical `codex_protocol::auth::AuthMode` domain values.
- Kept the app-server wire DTO unchanged and added an explicit app-side
conversion.
- Removed production app-server-protocol dependencies from login,
model-provider-info, models-manager, and otel call paths.

## Stack

This is PR 2 of 6, stacked on [PR
#29714](https://github.com/openai/codex/pull/29714). Review only the
delta from `codex/split-json-rpc-protocols`. Next: [PR
#29722](https://github.com/openai/codex/pull/29722).

## Validation

- Auth and login coverage passed in the focused protocol/domain test
run.
- App-server account and auth conversion coverage passed.

Adam Perry @ OpenAI · 2026-06-24 03:10:20 +00:00

31372078d1

[codex] Use input items for Responses Lite tools (#27946 )

When using Responses Lite, we should all use `additional_tools` and a
developer item instead of the top level tools array & instructions
field. This keeps things 1-to-1.

Forced namespacing for _all_ tools will land in a following PR after
some coordination & fixes in Responses API (around collisions & return
items).

The goal is to eventually expand the scope of this to _all_ requests
from codex, but that will require larger coordination across providers &
slower rollout.

rka-oai · 2026-06-22 23:56:16 -07:00

33cc928d33

[codex] Expose service tier and reasoning effort in OTEL (#29155 )

## Summary

NVIDIA asked to measure Fast mode usage and reasoning effort from Codex
CLI OTEL logs. Add the finalized `service_tier` and
`model_reasoning_effort` to the existing `codex.sse_event`
`response.completed` record.

This intentionally reuses the existing completion event and leaves
transport APIs and shared telemetry plumbing unchanged.

## Testing

- `cargo build -p codex-cli --bin codex`
- `just test -p codex-core responses_api_emits_api_request_event`
- End-to-end with the built CLI and a local OTLP/HTTP collector:
- Fast/high emitted `service_tier=priority` and
`model_reasoning_effort=high` with token usage.
- Standard/low omitted `service_tier` and emitted
`model_reasoning_effort=low` with token usage.

daniel-oai · 2026-06-22 20:44:48 -07:00

e0ac5d3c15

Stop logging every Responses WebSocket event (#29432 )

## Why

Every successful Responses WebSocket event currently produces three
local log records: the full payload at TRACE, an OpenTelemetry log
event, and an OpenTelemetry trace event.

On busy threads these records fill the 1,000-row log partition in
seconds and cause continuous SQLite insert-and-prune churn.

Related to
https://openai.slack.com/archives/C095U48JNL9/p1782128972644209

## What changed

- Stop logging each successful Responses WebSocket payload at TRACE.
- Stop emitting `codex.websocket_event` as OpenTelemetry log and trace
events.
- Keep WebSocket event counters, duration metrics, response timing
metrics, parsing, and error handling.

jif · 2026-06-22 17:43:08 +02:00

91e6da943b

Propagate safety buffering events to app-server clients (#29371 )

Responses API safety buffering metadata currently stops at the transport
boundary, so app-server clients cannot render the in-progress safety
review state.

This change:
- decodes and deduplicates `safety_buffering` metadata from Responses
API SSE and WebSocket events without suppressing the original response
event
- emits a typed core event containing the requested model plus backend
use cases and reasons
- forwards that event as `turn/safetyBuffering/updated` through
app-server v2 and updates generated protocol schemas
- keeps the side-channel event out of persisted rollouts and turn timing

This supports the Codex Apps buffering UX and depends on the Responses
API backend work in https://github.com/openai/openai/pull/1044569 and
https://github.com/openai/openai/pull/1044571.

Validation:
- focused `codex-core` safety-buffering integration test passes
- `cargo check -p codex-core -p codex-app-server -p
codex-app-server-protocol`
- `just fix -p codex-api -p codex-protocol -p codex-core -p
codex-app-server-protocol -p codex-app-server -p codex-rollout -p
codex-rollout-trace -p codex-otel`
- `just fmt`
- broad package test run: 4,430/4,492 passed; 62 unrelated
local-environment/concurrency failures involved unavailable test
binaries, MCP subprocess setup, and app-server timeouts

Francis Chalissery · 2026-06-22 03:39:14 +00:00

566f7bf631

[codex] Use expect in integration tests (#28441 )

The workspace denies `clippy::expect_used` in production. Although
`clippy.toml` allows `expect` in tests, Bazel Clippy compiles
integration-test helper code in a way that does not receive that
exemption, which encouraged verbose `unwrap_or_else(... panic!(...))`
and equivalent `match`/`let else` forms.

This allows `clippy::expect_used` once at each integration-test crate
root (including aggregated suites and test-support libraries), then
replaces manual panic-based Result and Option unwraps with
`expect`/`expect_err`. Standalone `tests/*.rs` files remain their own
crate roots. Intentional assertion and unexpected-variant panics remain
unchanged, and the production `expect_used = "deny"` lint remains in
place.

The cleanup is mechanical and net-negative in line count.

pakrym-oai · 2026-06-15 21:53:47 -07:00

e752f7b4ae

[codex] Add second-based OTEL duration histograms (#27058 )

## Why

Exec-server request and connection latencies need fractional-second
histograms. The existing duration API records integer milliseconds and
uses millisecond-scale buckets.

## What changed

- Adds a described duration API that records `Duration` values as
fractional seconds.
- Uses second-scale explicit histogram boundaries.
- Caches duration histograms by name, unit, and description, matching
the existing instrument caching model.
- Covers exact boundaries, representative bucket placement, fractional
sums, and exported metadata.

This PR only adds the duration primitive. It does not add exec-server
adoption.

## Stack

1. #26091: counter descriptions
2. #27057: gauge instruments
3. **#27058: second-based duration histograms**
4. #25019: initialize exec-server OpenTelemetry at startup

Related independent coverage: #27059 tests OTLP HTTP log and trace event
export.

## Validation

- `just test -p codex-otel`

richardopenai · 2026-06-15 17:10:52 -07:00

8aac63f477

feat(core): add metadata field to ResponseItem (#28355 )

## Description

This PR adds an optional `metadata` field to `ResponseItem` for
Responses API calls. Only mechanical plumbing, no actual values
populated and sent yet. Turns out just adding a new field to
`ResponseItem` has quite a large blast radius already.

This change is backwards compatible because `metadata` is optional and
omitted when absent, so existing response items and rollout history
without it still deserialize and requests that do not set it keep the
same wire shape. For provider compatibility, we strip out `metadata`
before non-OpenAI Responses requests so Azure and AWS Bedrock never see
this field.

My followup PR here will actually make use of it to start storing and
passing along `turn_id`: https://github.com/openai/codex/pull/28360

## What changed

- Added `ResponseItemMetadata` with optional `turn_id`, plus optional
`metadata` on Responses API item variants and inter-agent communication.
- Preserved item metadata through response-item rewrites such as
truncation, missing tool-output synthesis, compaction history
rebuilding, visible-history conversion, rollout/resume, and generated
app-server schemas/types.
- Strip item metadata from non-OpenAI Responses requests while
preserving it for OpenAI-shaped requests.
- Updated the mechanical fixture/test construction churn required by the
new optional field.

Owen Lin · 2026-06-15 15:05:28 -07:00

040dafa32d

[codex] Cover OTLP HTTP log and trace event export (#27059 )

## Why

The generic OTLP HTTP paths for log events and trace events need
end-to-end coverage before exec-server relies on them.

## What changed

- Adds loopback coverage for exporting `codex_otel.log_only` events to
`/v1/logs`.
- Verifies `codex_otel.trace_safe` events are present in the exported
trace payload.

This is a test-only PR. It does not change OTEL runtime behavior or
metric APIs.

## Related work

- #26091: counter descriptions
- #27057: gauge instruments
- #27058: second-based duration histograms

This PR is independent and can land directly on `main`.

## Validation

- `just test -p codex-otel`
- `just fix -p codex-otel`
- `just fmt`

richardopenai · 2026-06-15 09:59:26 -07:00

bed60e3746

feat: add Bedrock API key as a managed auth mode (#27443 )

## Why

Codex needs to manage Amazon Bedrock API key credentials through the
existing auth lifecycle instead of introducing a separate auth manager
or provider-specific credential file. Treating Bedrock API key login as
a primary auth mode gives it the same persistence, keyring, reload, and
logout behavior as the existing OpenAI API key and ChatGPT modes.

The credential is valid only for the `amazon-bedrock` model provider.
OpenAI-compatible providers must reject this auth mode rather than
treating the Bedrock key as an OpenAI bearer token.

## What changed

- Added `bedrockApiKey` as an app-server `AuthMode` and
`CodexAuth::BedrockApiKey` as a primary `AuthManager` mode.
- Added `BedrockApiKeyAuth`, containing the API key and AWS region, to
the existing `AuthDotJson` payload stored in `$CODEX_HOME/auth.json` or
the configured keyring backend.
- Added `login_with_bedrock_api_key(...)`, parallel to
`login_with_api_key(...)`, which replaces the current stored login with
Bedrock credentials.
- Reused generic auth reload and logout behavior instead of adding a
Bedrock-specific auth manager or logout path.
- Updated login restrictions, status reporting, diagnostics, telemetry
classification, generated app-server schemas, and auth fixtures for the
new mode.
- Added explicit errors when Bedrock API key auth is selected with an
OpenAI-compatible model provider.

This PR establishes managed storage and auth-mode behavior. Routing the
managed key and region into Amazon Bedrock requests will be in follow-up
PRs.

Celia Chen · 2026-06-10 20:42:38 -07:00

06afd63f4a

[codex] Add reusable OTEL gauge instruments (#27057 )

## Why

Exec-server observability needs current-value measurements in addition
to counters. The reusable OTEL client should expose that primitive
without coupling it to exec-server runtime behavior.

## What changed

- Adds integer gauge instruments, with optional descriptions.
- Caches gauges by name and description so instrument metadata remains
part of the declaration identity.
- Covers gauge values, descriptions, merged attributes, and OTLP HTTP
export.

This PR only adds the gauge primitive. It does not add second-based
duration histograms or exec-server adoption.

## Stack

1. #26091: counter descriptions
2. **#27057: gauge instruments**
3. #27058: second-based duration histograms

Related independent coverage: #27059 tests OTLP HTTP log and trace event
export.

## Validation

- `just test -p codex-otel`
- `just fix -p codex-otel`
- `just fmt`

richardopenai · 2026-06-10 21:36:38 +00:00

7e5e41daea

[codex] Add OTEL counter descriptions (#26091 )

## Why

Metric descriptions should be declared with reusable OTEL instruments
instead of being coupled to individual consumers. Counter descriptions
are the smallest API primitive needed by the exec-server observability
work.

## What changed

- Adds `counter_with_description` while preserving the existing counter
API.
- Caches counters by name and description so instrument metadata remains
part of the declaration identity.
- Covers the exported description together with the existing value and
attribute contract.

This PR only adds counter descriptions. It does not add gauges,
second-based durations, or exec-server adoption.

## Stack

1. **#26091: counter descriptions**
2. #27057: gauge instruments
3. #27058: second-based duration histograms

Related independent coverage: #27059 tests OTLP HTTP log and trace event
export.

The `codex-exec-server` bounded service tag now stays with the
exec-server adoption change instead of this reusable infrastructure
stack.

## Validation

- `just test -p codex-otel`
- `just fix -p codex-otel`
- `just fmt`

richardopenai · 2026-06-08 22:29:51 +00:00

feca160da4

[codex-rs] support v2 personal access tokens (#25731 )

## Summary

- add v2 personal access token support for `codex login
--with-access-token` and `CODEX_ACCESS_TOKEN`
- classify opaque `at-` tokens separately from legacy Agent Identity
JWTs
- hydrate required ChatGPT account metadata through AuthAPI
`/v1/user-auth-credential/whoami`
- use PATs directly as bearer tokens while preserving existing ChatGPT
account surfaces
- expose PAT-backed auth as the explicit `personalAccessToken`
app-server auth mode

## Implementation

PAT auth is intentionally small and stateless. Loading a PAT performs
one AuthAPI metadata request, stores the hydrated metadata in the
in-memory auth object, and redacts the secret from debug output. Legacy
Agent Identity JWT handling remains unchanged. The shared access-token
classifier lives in a private neutral module because it dispatches
between both credential types.

PAT hydration fails closed when AuthAPI omits any required metadata,
including email. Hydrated metadata is intentionally not persisted:
startup performs a live `whoami` preflight so revoked tokens or changed
account metadata are not accepted from a stale cache.

## Workspace restriction scope

This change intentionally does **not** apply
`forced_chatgpt_workspace_id` to PAT authentication. The setting is a
client-side config guardrail, not an authorization boundary, and PAT
does not currently require workspace-ID parity. The PAT login and
`CODEX_ACCESS_TOKEN` paths therefore validate through AuthAPI without
threading workspace-restriction state through access-token loading.
Existing workspace checks for non-PAT auth remain on their established
paths.

## App-server compatibility

The public app-server `AuthMode` is shared across v1 and v2, and
PAT-backed auth reports `personalAccessToken` through both APIs.
Following human review, this intentionally removes the temporary v1
compatibility mapping that reported PATs as `chatgpt`; the deprecated v1
API is kept in parity with v2 rather than maintaining a separate closed
enum. Clients with exhaustive auth-mode handling in either API version
must add the new case and should generally treat it as ChatGPT-backed
unless they need PAT-specific behavior.

The v1 auth-status response still omits the raw PAT when `includeToken`
is requested because that response cannot carry the account metadata
needed to reuse the credential safely. Persisted PAT auth also omits the
new enum value so older Codex builds can deserialize `auth.json` and
infer PAT auth from the credential field after a rollback.

## Validation

Latest review-fix validation:

- `CARGO_INCREMENTAL=0 just test -p codex-login` (126 passed)
- `CARGO_INCREMENTAL=0 just test -p codex-cli` (263 passed)
- `CARGO_INCREMENTAL=0 just test -p codex-cli
stored_auth_validation_handles_personal_access_token`
- `CARGO_INCREMENTAL=0 just test -p codex-app-server-protocol` (226
passed)
- `CARGO_INCREMENTAL=0 just test -p codex-models-manager
refresh_available_models_uses_remote_only_catalog_for_chatgpt_auth`
- `CARGO_INCREMENTAL=0 just test -p codex-tui
existing_non_oauth_chatgpt_login_counts_as_signed_in`
- `CARGO_INCREMENTAL=0 just fix -p codex-login -p
codex-app-server-protocol -p codex-models-manager -p codex-tui -p
codex-cli`
- `just fmt`
- `git diff --check`

The broader `codex-tui` suite previously compiled and ran 2,834 tests.
Three unrelated environment-sensitive guardian/IDE-socket tests failed
after retries; the PAT-relevant TUI coverage passed.

cooper-oai · 2026-06-05 17:36:18 -07:00

df7818c7d1

[codex] Forward turn moderation metadata through app-server (#25710 )

## Why
First-party backends can supply turn-scoped moderation metadata that
app-server clients need for client-side presentation. Exposing this as
an experimental typed notification lets opted-in clients consume it
without interpreting raw Responses API events.

## What changed
- forward `response.metadata.openai_chatgpt_moderation_metadata` from
Responses API SSE and WebSocket streams as turn-scoped moderation
metadata
- emit the experimental app-server v2 `turn/moderationMetadata`
notification with `{ threadId, turnId, metadata }`
- add app-server integration coverage for the typed moderation metadata
notification

## Testing
- `just test -p codex-core
build_ws_client_metadata_includes_window_lineage_and_turn_metadata`
- `just test -p codex-core` (fails locally: 46 failures and 1 timeout,
primarily missing `test_stdio_server` and shell snapshot timeouts)
- `just test -p codex-app-server-protocol`
- `just test -p codex-app-server
turn_moderation_metadata_emits_typed_notification_v2`
- `just test -p codex-app-server` (fails locally: 792 passed, 10 failed,
and 5 timed out; failures are in existing environment-sensitive tests,
primarily because nested macOS `sandbox-exec` is not permitted)
- `just write-app-server-schema --experimental --schema-root
/tmp/codex-app-server-schema-experimental`

carlc-oai · 2026-06-05 02:41:06 -07:00

55aa071b17

Encrypt multi-agent v2 message payloads (#26210 )

## Why

Multi-agent v2 currently routes agent instructions through normal tool
arguments and inter-agent context. That means the parent model can emit
plaintext task text, Codex can persist it in history/rollouts, and the
recipient can receive it as ordinary assistant-message JSON.

This changes the v2 path so agent instructions stay encrypted between
model calls: Responses encrypts the `message` argument returned by the
model, Codex forwards only that ciphertext, and Responses decrypts it
internally for the recipient model.

## What changed

- Mark the v2 `message` parameter as encrypted for `spawn_agent`,
`send_message`, and `followup_task`.
- Treat multi-agent v2 tool `message` values as ciphertext
unconditionally.
- Store v2 inter-agent task text in
`InterAgentCommunication.encrypted_content` with empty plaintext
`content`.
- Convert encrypted inter-agent communications into the Responses
`agent_message` input item before sending the child request.
- Preserve `agent_message` items across history, rollout, compaction,
telemetry, and app-server schema paths.
- Leave multi-agent v1 unchanged.

## Message shape

The model still calls the v2 tools with a `message` argument, but that
value is now ciphertext:

```json
{
  "name": "spawn_agent",
  "arguments": {
    "task_name": "worker",
    "message": "<ciphertext>"
  }
}
```

Codex stores the task as encrypted inter-agent communication:

```json
{
  "author": "/root",
  "recipient": "/root/worker",
  "content": "",
  "encrypted_content": "<ciphertext>",
  "trigger_turn": true
}
```

When Codex builds the recipient request, it forwards the ciphertext
using the new Responses input item:

```json
{
  "type": "agent_message",
  "author": "/root",
  "recipient": "/root/worker",
  "content": [
    {
      "type": "encrypted_content",
      "encrypted_content": "<ciphertext>"
    }
  ]
}
```

Responses decrypts that item internally for the recipient model.

## Context impact

- Parent context no longer carries plaintext v2 agent task instructions
from these tool arguments.
- Codex rollout/history stores ciphertext for v2 agent instructions.
- Recipient requests receive an `agent_message` item instead of
assistant commentary JSON for encrypted task delivery.
- Plaintext completion/status notifications are still plaintext because
they are Codex-generated status messages, not encrypted model tool
arguments.

## Validation

- `just test -p codex-tools`
- `just test -p codex-protocol`
- `just test -p codex-rollout`
- `just test -p codex-rollout-trace`
- `just test -p codex-otel`
- `just write-app-server-schema`

jif · 2026-06-05 10:25:57 +02:00

5f4d06ef18

[codex] Emit sandbox outcome telemetry event (#25955 )

## Summary

Adds a dedicated `codex.sandbox_outcome` telemetry event so we can query
sandbox edge outcomes without threading sandbox metadata through
tool-result output types.

This is meant to make sandbox failures and approved escalation retries
visible in OTEL while keeping the existing `codex.tool_result` event
shape focused on tool completion data.

## What changed

- Adds `SessionTelemetry::sandbox_outcome(...)`, which emits
`codex.sandbox_outcome` as both a log and trace event.
- Records the tool name, call id, sandbox outcome, initial attempt
duration, and escalated attempt duration when a retry runs.
- Emits `denied` when the sandbox blocks execution and no retry is run.
- Emits `timed_out` and `signal` when those sandbox errors surface from
tool execution.
- Emits `escalated` when the initial sandboxed attempt fails and the
approved unsandboxed retry succeeds.
- Adds OTEL coverage for the new event payload, including timing fields.

## Validation

- `RUST_MIN_STACK=8388608 just test -p codex-core
sandbox_outcome_event_records_outcome
handle_sandbox_error_user_approves_retry_records_tool_decision`
- `just test -p codex-otel
otel_export_routing_policy_routes_tool_result_log_and_trace_events
runtime_metrics_summary_collects_tool_api_and_streaming_metrics`
- `just fix -p codex-core`
- `just fix -p codex-otel`

rreichel3-oai · 2026-06-04 20:58:14 -04:00

ecae412740

[codex] Support model-defined reasoning efforts (#26444 )

## Summary
- accept non-empty model-defined reasoning effort values while
preserving built-in effort behavior
- propagate the non-Copy effort type through core, app-server, TUI,
telemetry, and persistence call sites
- preserve string wire encoding and expose an open-string schema for
clients
- update model selection and shortcut behavior for model-advertised
effort values

## Root cause
`ReasoningEffort` gained a string-backed custom variant, so it could no
longer implement `Copy` or rely on derived closed-enum serialization.
Existing consumers still moved effort values from shared references and
assumed a fixed built-in value set.

## Validation
- `just fmt`
- Local tests and compilation were not run per request; relying on CI.

Ahmed Ibrahim · 2026-06-04 13:36:24 -07:00

8ac304c299

Add Guardian review metrics (#24897 )

## Why

Guardian reviews already emit analytics events, but we do not expose
aggregate OpenTelemetry metrics for review volume, latency, token usage,
or terminal outcomes. That makes it harder to monitor Guardian behavior
during rollouts and to compare review outcomes by source, action type,
session kind, model, and failure mode.

## What Changed

- Added Guardian review metric names for count, total duration, time to
first token, and token usage in `codex-rs/otel`.
- Added `core/src/guardian/metrics.rs` to convert
`GuardianReviewAnalyticsResult` into sanitized metric tags covering
decision, terminal status, failure reason, approval request source,
reviewed action, session kind, risk/outcome, model, reasoning effort,
and context/truncation state.
- Emitted the new metrics from `track_guardian_review` for each terminal
Guardian review result.

## Testing

- Added
`guardian_review_metrics_record_counts_durations_and_token_usage`, which
verifies the emitted count, duration, TTFT, token usage histograms, and
tag set through the in-memory metrics exporter.

jif-oai · 2026-05-28 14:07:25 +02:00

3abf96739b

otel: drop legacy profile usage telemetry (#24061 )

## Summary
- drop the dead legacy profile usage metric and active-profile
conversation-start fields
- update role comments so they describe provider and service-tier
preservation without legacy config-profile wording
- pair the code cleanup with the file-backed profile docs update in
openai/developers-website#1476

## Testing
- `just fmt`
- `cargo test -p codex-otel`
- `cargo test -p codex-core` *(fails: existing stack overflow in
`mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message`)*
- `cargo test -p codex-core --lib
mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message`
*(fails with the same stack overflow)*

jif-oai · 2026-05-22 13:14:44 +02:00

47476e8a8a

Split plugin install discovery into list and request tools (#23372 )

## Summary
- Add `list_available_plugins_to_install` as the inventory step for
plugin and connector install suggestions.
- Slim `request_plugin_install` so it only handles the actual
elicitation, instead of carrying the full discoverable list in its
prompt.
- Emit send-time telemetry when an install elicitation is dispatched,
including requested tool identity in the event payload.
- Emit install-result telemetry through `SessionTelemetry`, including
tool type, user response action, and completion status.
- Update registration and tests to cover the new two-step flow while
keeping the existing `tool_suggest` feature gate unchanged.

## Testing
- `just fmt`
- `cargo test -p codex-tools`
- `cargo test -p codex-core request_plugin_install`
- `cargo test -p codex-core list_available_plugins_to_install`
- `cargo test -p codex-core
install_suggestion_tools_can_be_registered_without_search_tool`
- `cargo test -p codex-otel
manager_records_plugin_install_suggestion_metric`
- `cargo test -p codex-otel
manager_records_plugin_install_elicitation_sent_metric`
- `just fix -p codex-core`
- `just fix -p codex-tools`
- `just fix -p codex-otel`
- `cargo check -p codex-core`

Matthew Zeng · 2026-05-19 14:45:37 -07:00

8335b56c33

goal: pause continuation loops on usage limits and blockers (#23094 )

Addresses #22833, #22245, #23067

## Why
`/goal` can keep synthesizing turns even when the next turn cannot make
meaningful progress. Hard usage exhaustion can replay failing turns, and
repeated permission or external-resource blockers can keep burning
tokens while waiting for user or system intervention.

## What changed
- Add resumable `blocked` and `usageLimited` goal states. As with
`paused`, goal continuation stops with these states.
- Move to `usageLimited` after usage-limit failures.
- Allow the built-in `update_goal` tool to set `blocked` only under
explicit repeated-impasse guidance. Updated goal continuation prompt to
specify that agent should use `blocked` only when it has made at least
three attempts to get past an impasse.

Most of the files touched by this PR are because of the small app server
protocol update.

## Validation

I manually reproduced a number of situations where an agent can run into
a true impasse and verified that it properly enters `blocked` state. I
then resumed and verified that it once again entered `blocked` state
several turns later if the impasse still exists.

I also manually reproduced the usage-limit condition by creating a
simulated responses API endpoint that returns 429 errors with the
appropriate error message. Verified that the goal runtime properly moves
the goal into `usageLimited` state and TUI UI updates appropriately.
Verified that `/goal resume` resumes (and immediately goes back into
`ussageLImited` state if appropriate).


## Follow-up PRs

Small changes will be needed to the GUI clients to properly handle the
two new states.

Eric Traut · 2026-05-18 11:28:53 -07:00

0d344aca9b

chore: goal resumed metrics (#23301 )

Add metrics for goal resume

jif-oai · 2026-05-18 15:19:23 +02:00

500ef67ed1

Preserve image detail in app-server inputs (#20693 )

## Summary

- Add optional image detail to user image inputs across core, app-server
v2, thread history/event mapping, and the generated app-server
schemas/types.
- Preserve requested detail when serializing Responses image inputs:
omitted detail stays on the existing `high` default, while explicit
`original` keeps local images on the original-resolution path.
- Support `high`/`original` consistently for tool image outputs,
including MCP `codex/imageDetail`, code-mode image helpers, and
`view_image`.

Curtis 'Fjord' Hawthorne · 2026-05-15 15:04:04 -07:00

8543e39885

[codex] Use compaction_trigger item for remote compaction v2 (#22809 )

## Why

Remote compaction v2 was still using `context_compaction` as both the
request trigger and the compacted output shape. The Responses API now
has the landed contract for this flow: Codex sends a dedicated `{
"type": "compaction_trigger" }` input item, and the backend returns the
standard `compaction` output item with encrypted content.

This aligns the v2 path with that wire contract while preserving the
existing local compacted-history post-processing behavior.

## What changed

- Add `ResponseItem::CompactionTrigger` and regenerate the app-server
protocol schema fixtures.
- Send `compaction_trigger` from `remote_compaction_v2` instead of a
payload-less `context_compaction`.
- Collect exactly one backend `compaction` output item, then reuse the
existing compacted-history rebuilding path.
- Treat the trigger item as a transient request marker rather than model
output or persisted rollout/memory content.

## Verification

- `cargo test -p codex-protocol compaction_trigger`
- `cargo test -p codex-core remote_compact_v2`
- `cargo test -p codex-core compact_remote_v2`
- `cargo test -p codex-core
responses_websocket_sends_response_processed_after_remote_compaction_v2`
- `just write-app-server-schema`
- `cargo test -p codex-app-server-protocol schema_fixtures`

jif-oai · 2026-05-15 11:40:35 +02:00

0322ac3df8

Simplify MCP tool handler plumbing (#21595 )

## Why
The MCP tool path had accumulated a few core-owned special cases: a
dedicated payload variant, resolver plumbing, a legacy `AfterToolUse`
translation path, and a side channel for parallel-call metadata. That
made `ToolRegistry` and the spec builder know more about MCP than they
needed to.

This change moves MCP-specific execution details back onto `ToolInfo`
and `McpHandler` so `codex-core` can treat MCP calls like normal
function calls while still preserving MCP-specific dispatch and
telemetry behavior where it belongs.

## What changed
- removed `resolve_mcp_tool_info`, `ToolPayload::Mcp`, `ToolKind`, and
the remaining registry-side MCP resolver path
- stored MCP routing metadata directly on `McpHandler` and `ToolInfo`,
including `supports_parallel_tool_calls`
- deleted the legacy `AfterToolUse` consumer in `core`, which removes
the need for handler-specific `after_tool_use_payload` implementations
- switched tool-result telemetry to handler-provided tags and kept
MCP-specific dispatch payload construction inside the handler
- simplified tool spec planning/building by passing `ToolInfo` directly
and dropping the direct/deferred MCP wrapper structs and the
parallel-server side table

## Testing
- `cargo check -p codex-core -p codex-mcp -p codex-otel`
- `cargo test -p codex-core
mcp_parallel_support_uses_exact_payload_server`
- `cargo test -p codex-core
direct_mcp_tools_register_namespaced_handlers`
- `cargo test -p codex-core
search_tool_description_lists_each_mcp_source_once`
- `cargo test -p codex-mcp
list_all_tools_uses_startup_snapshot_while_client_is_pending`
- `just fix -p codex-core -p codex-mcp -p codex-otel`

pakrym-oai · 2026-05-12 00:11:31 +00:00

ed5944ba1d

Add production startup and TTFT telemetry (#22198 )

## Why

While investigating `codex exec hi` startup latency, the useful
questions were not "is startup slow?" but "which durable bucket is slow
in production?"

The path we observed has a few distinct stages:

1. `thread/start` creates the session
2. startup prewarm builds the turn context, tools, and prompt
3. startup prewarm warms the websocket
4. the first real turn resolves the prewarm
5. the model produces the first token

Before this PR, production telemetry had some of the raw measurements
already:

- aggregate startup-prewarm duration / age-at-first-turn metrics
- TTFT as a metric
- websocket request telemetry

But there was no coherent production event stream for the startup
breakdown itself, and TTFT was metric-only. That made it hard to answer
the same latency questions from OpenTelemetry-backed logs without adding
one-off local instrumentation.

## What changed

Add durable production telemetry on the existing `SessionTelemetry`
path:

- new `codex.startup_phase` OTel log/trace events plus
`codex.startup.phase.duration_ms`
- new `codex.turn_ttft` OTel log/trace events while preserving the
existing TTFT metric

The startup phase event is emitted for the coarse buckets we actually
observed while running `exec hi`:

- `thread_start_create_thread`
- `startup_prewarm_total`
- `startup_prewarm_create_turn_context`
- `startup_prewarm_build_tools`
- `startup_prewarm_build_prompt`
- `startup_prewarm_websocket_warmup`
- `startup_prewarm_resolve`

These phases are intentionally low-cardinality so they remain safe as
production telemetry tags.

## Why this shape

This keeps the instrumentation on the same production path as the rest
of the session telemetry instead of adding a local debug-only trace
mode. It also avoids changing startup behavior:

- prewarm still runs
- no control flow changes
- no extra remote calls
- no user-visible behavior changes

One boundary is intentional: very early process bootstrap that happens
before a session exists is not included here, because this PR uses
session-scoped production telemetry. The expensive buckets we were
trying to understand after `thread/start` are now covered durably.

## Verification

- `cargo test -p codex-otel`
- `cargo test -p codex-core turn_timing`
- `cargo test -p codex-core
regular_turn_emits_turn_started_without_waiting_for_startup_prewarm`
- `cargo test -p codex-core
interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted`
- `cargo test -p codex-app-server thread_start`
- `just fix -p codex-otel -p codex-core -p codex-app-server`

I also ran `cargo test -p codex-core`; it built successfully and then
hit an existing unrelated stack overflow in
`tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.

Matthew Zeng · 2026-05-11 23:58:36 +00:00

e15ecc9c35

Add process-scoped SQLite telemetry (#22154 )

## Summary
- add SQLite init, backfill-gate, and fallback telemetry without
introducing a cross-cutting state-db access wrapper
- install one process-scoped telemetry sink after OTEL startup and let
low-level state/rollout paths emit through it directly
- add process-start metrics for the process owners that initialize
SQLite

---------

Co-authored-by: Owen Lin <owen@openai.com>

jif-oai · 2026-05-11 11:32:40 -07:00

b401666ca5

codex-otel: validate provider span attributes consistently (#21749 )

Provider initialization installs process-global OTEL state, so invalid
trace metadata needs to fail before setup begins.

Use the same span attribute validator as config loading when traces are
exported so provider startup enforces the config contract without
duplicating validation logic.

bbrown-oai · 2026-05-08 08:20:49 -07:00

607b0dd1f0

codex-otel: add configurable trace metadata (#21556 )

Add Codex config for static trace span attributes and structured W3C
tracestate field upserts. The config flows through OtelSettings so
callers can attach trace metadata without touching every span call site.

Apply span attributes with an SDK span processor so every exported
trace span carries the configured metadata. Model tracestate as nested
member fields so configured keys can be upserted while unrelated
propagated state in the same member is preserved.

Validate configured tracestate before installing provider-global state,
including header-unsafe values the SDK does not reject by itself. This
keeps Codex from propagating malformed trace context from config.

Update the config schema, public docs, and OTLP loopback coverage for
config parsing, span export, propagation, and invalid-header rejection.

bbrown-oai · 2026-05-07 16:06:57 -07:00

31b233c7c6

revert legacy notify deprecation (#21152 )

# Why

Revert #20524 for now because the computer use plugin has not migrated
off legacy `notify` yet. Keeping the deprecation in place today would
show users a warning before the plugin path is ready to move, so this
rolls the change back until that migration is complete.

# What

- revert the legacy `notify` deprecation change from #20524
- restore the prior `notify` behavior and remove the temporary
deprecation metrics/docs from that change

Once the computer use plugin has migrated, we can land the same
deprecation again.

Abhinav · 2026-05-05 10:34:44 -07:00

13be504063

Add goal lifecycle metrics (#20799 )

## Why

Adding goal metrics makes it possible to track how often goals are
created, completed, and stopped by budget limits, plus the final token
and wall-clock usage for terminal outcomes.

## What Changed

- Added OpenTelemetry metric constants for goal lifecycle tracking:
- `codex.goal.created`: increments each time a new persisted goal is
created or an existing goal is replaced with a new objective.
- `codex.goal.completed`: increments when a goal transitions to
`complete`.
- `codex.goal.budget_limited`: increments when a goal transitions to
`budget_limited` because its token budget has been reached.
- `codex.goal.token_count`: records the final persisted token count when
a goal transitions to `complete` or `budget_limited`.
- `codex.goal.duration_s`: records the final persisted elapsed
wall-clock time, in seconds, when a goal transitions to `complete` or
`budget_limited`.
- Emitted creation metrics when a goal is created or replaced.
- Emitted terminal outcome counters and final usage histograms when a
goal transitions to `complete` or `budget_limited`, avoiding
double-counting later in-flight accounting for already budget-limited
goals.
- Added focused `codex-core` tests for create/complete metrics and
one-time budget-limit metrics.

Eric Traut · 2026-05-05 09:21:54 -07:00

91b7350187

feat: add remote compaction v2 Responses client path (#20773 )

## Why

This adds the `remote_compaction_v2` client path so remote compaction
can run through the normal Responses stream and install a
`context_compaction` item that trigger a compaction.

The goal is to migrate some of the compaction logic on the client side

We keeps the v2 transport behind a feature flag while letting follow-up
requests reuse the compacted context instead of falling back to the
legacy compaction item shape.

## What changed

- add `ResponseItem::ContextCompaction` and refresh the generated
app-server / schema / TypeScript fixtures that expose response items on
the wire
- add `core/src/compact_remote_v2.rs` to send compaction through the
standard streamed Responses client, require exactly one
`context_compaction` output item, and install that item into compacted
history
- route manual compact and auto-compaction through the v2 path when
`remote_compaction_v2` is enabled, while keeping the existing remote
compaction path as the fallback
- preserve the new item type across history retention, follow-up request
construction, telemetry, rollout persistence, and rollout-trace
normalization
- add targeted coverage for the feature flag, `context_compaction`
serialization, rollout-trace normalization, and remote-compaction
follow-up behavior

## Verification

- added protocol tests for `context_compaction`
serialization/deserialization in `protocol/src/models.rs`
- added rollout-trace coverage for `context_compaction` normalization in
`rollout-trace/src/reducer/conversation_tests.rs`
- added remote compaction integration coverage for v2 follow-up reuse
and mixed compaction output streams in
`core/tests/suite/compact_remote.rs`

---------

Co-authored-by: Codex <noreply@openai.com>

jif-oai · 2026-05-04 14:15:01 +02:00

d927f61208

deprecate legacy notify (#20524 )

# Why

`notify` is the remaining compatibility surface from the legacy hook
implementation. The newer lifecycle hook engine now owns the active hook
system, so we should start steering users away from adding new `notify`
configs before removing the old path entirely. This also adds a
lightweight watchpoint for the deprecation so we can see how much legacy
usage remains before the clean drop.

# What

- emit a startup deprecation notice when a non-empty `notify` command is
configured
- emit `codex.notify.configured` when a session starts with legacy
`notify` configured
- emit `codex.notify.run` when the legacy notify path fires after a
completed turn
- mark `notify` as deprecated in the config schema and repo docs
- remove the orphaned `codex-rs/hooks/src/user_notification.rs` file
that is no longer compiled
- add regression coverage for the new deprecation notice

# Next steps

A follow-up PR can remove the legacy notify path entirely once we are
ready for the clean drop. Before then, we can watch
`codex.notify.configured` and `codex.notify.run` to understand the
deprecation impact and remaining active usage. The cleanup PR should
then delete the `notify` config field, the `legacy_notify`
implementation, the old compatibility dispatch types and callsites that
only exist for the legacy path, and the remaining compatibility
docs/tests.

# Testing

- `cargo test -p codex-hooks`
- `cargo test -p codex-config`
- `cargo test -p codex-core emits_deprecation_notice_for_notify`

Abhinav · 2026-05-01 17:35:21 +00:00

78baa20780

install WFP filters for Windows sandbox setup (#20101 )

## Summary

This PR installs a first wave of WFP (Windows Filtering Platform)
filters that reduce the surface area of network egress vulnerabilities
for the Windows Sandbox.

- Add persistent Windows Filtering Platform provider, sublayer, and
filters for the Windows sandbox offline account.
- Install WFP filters during elevated full setup, log failures
non-fatally, and emit setup metrics when analytics are enabled.
- Bump the Windows sandbox setup version so existing users rerun full
setup and receive the new filters.

## What WFP is
Windows Filtering Platform (WFP) is the low-level Windows networking
policy engine underneath things like Windows Firewall. It lets
privileged code install persistent filtering rules at specific network
stack layers, with conditions like "only traffic from this Windows
account" or "only this remote port," and an action like block.

In this change, we create a Codex-owned persistent WFP provider and
sublayer, then install block filters scoped to the Windows sandbox's
offline user account via `ALE_USER_ID`. That means the filters are
targeted at sandboxed processes running as that account, rather than
globally affecting the host.

## Initial filter set
We are starting with 12 concrete WFP filters across a few high-value
bypass surfaces. The table below describes the filter families rather
than one filter per row:

| Area | Concrete filters | Purpose |
| --- | --- | --- |
| ICMP | 4 filters: ICMP v4/v6 on `ALE_AUTH_CONNECT` and
`ALE_RESOURCE_ASSIGNMENT` | Block direct ping-style network reachability
checks from the offline account. |
| DNS | 2 filters: remote port `53` on `ALE_AUTH_CONNECT_V4/V6` | Block
direct DNS queries that bypass our intended proxy/offline path. |
| DNS-over-TLS | 2 filters: remote port `853` on
`ALE_AUTH_CONNECT_V4/V6` | Block encrypted DNS attempts that could
bypass ordinary DNS interception. |
| SMB / NetBIOS | 4 filters: remote ports `445` and `139` on
`ALE_AUTH_CONNECT_V4/V6` | Block Windows file-sharing/network share
traffic from sandboxed processes. |

For IPv4/IPv6 coverage, the port-based filters are installed on both
`ALE_AUTH_CONNECT_V4` and `ALE_AUTH_CONNECT_V6`. ICMP also gets both
connect-layer and resource-assignment-layer coverage because ICMP
traffic is shaped differently from ordinary TCP/UDP port traffic.

## Validation
- `cargo fmt -p codex-windows-sandbox` (completed with existing
stable-rustfmt warnings about `imports_granularity = Item`)
- `cargo test -p codex-windows-sandbox wfp::tests`
- `cargo test -p codex-windows-sandbox` (fails in existing legacy
PowerShell sandbox tests because `Microsoft.PowerShell.Utility` could
not be loaded; WFP tests passed before that failure)

iceweasel-oai · 2026-04-30 12:39:01 -07:00

8121710ffe

[codex] Add token usage to turn tracing spans (#19432 )

## Why

Slow Codex turns are easier to debug when token usage is visible in the
trace itself, without joining against separate analytics. This adds
token usage to existing turn-handling spans for regular user turns only.

[Example
turn](https://openai.datadoghq.com/apm/trace/9d353efa2cb5de1f4c5b93dc33c3df04?colorBy=service&graphType=flamegraph&shouldShowLegend=true&sort=time&spanID=3555541504891512675&spanViewType=metadata&traceQuery=)
<img width="1447" height="967" alt="Screenshot 2026-04-24 at 3 03 07 PM"
src="https://github.com/user-attachments/assets/ab7bb187-e7fc-41f0-a366-6c44610b2b2c"
/>

## What Changed

Added response-level token fields on completed handle_responses spans:

gen_ai.usage.input_tokens
gen_ai.usage.cache_read.input_tokens
gen_ai.usage.output_tokens
codex.usage.reasoning_output_tokens
codex.usage.total_tokens
Added aggregate token fields on regular turn spans:

codex.turn.token_usage.*
Added an explicit regular-turn opt-in via
SessionTask::records_turn_token_usage_on_span() so this is not coupled
to span-name strings.

## Testing

- `cargo test -p codex-otel`
- `cargo test -p codex-core
turn_and_completed_response_spans_record_token_usage`
- `just fmt`
- `just fix -p codex-core`
- `just fix -p codex-otel`
- Manual local Electron/app-server smoke test: regular user turn emits
the new span fields

Known status: `cargo test -p codex-core` was attempted and failed in
unrelated existing areas: config approvals, request-permissions,
git-info ordering, and subagent metadata persistence.

charley-openai · 2026-04-28 11:41:32 -07:00

de2ccf9473

Remove ghost snapshots (#19481 )

## Summary
- Remove `ghost_snapshot` / `GhostCommit` from the Responses API surface
and generated SDK/schema artifacts.
- Keep legacy config loading compatible, but make undo a no-op that
reports the feature is unavailable.
- Clean up core history, compaction, telemetry, rollout, and tests to
stop carrying ghost snapshot items.

## Testing
- Unit tests passed for `codex-protocol`, `codex-core` targeted undo and
compaction flows, `codex-rollout`, and `codex-app-server-protocol`.
- Regenerated config and app-server schemas plus Python SDK artifacts
and verified they match the checked-in outputs.

pakrym-oai · 2026-04-27 18:48:57 -07:00

4e05f3053c

Add safety check notification and error handling (#19055 )

Adds a new app-server notification that fires when a user account has
been flagged for potential safety reasons.

Eric Traut · 2026-04-22 22:24:12 -07:00

bbff4ee61a

feat: Fairly trim skill descriptions within context budget (#18925 )

Preserve skill name/path entries whenever possible and trim descriptions
first, using round-robin character allocation so short descriptions do
not waste budget.

xl-openai · 2026-04-22 12:33:29 -07:00

b77791c228

feat: add explicit AgentIdentity auth mode (#18785 )

## Summary

This PR adds `CodexAuth::AgentIdentity` as an explicit auth mode.

An AgentIdentity auth record is a standalone `auth.json` mode. When
`AuthManager::auth().await` loads that mode, it registers one
process-scoped task and stores it in runtime-only state on the auth
value. Header creation stays synchronous after that because the task is
initialized before callers receive the auth object.

This PR also removes the old feature flag path. AgentIdentity is
selected by explicit auth mode, not by a hidden flag or lazy mutation of
ChatGPT auth records.

Reference old stack: https://github.com/openai/codex/pull/17387/changes

## Design Decisions

- AgentIdentity is a real auth enum variant because it can be the only
credential in `auth.json`.
- The process task is ephemeral runtime state. It is not serialized and
is not stored in rollout/session data.
- Account/user metadata needed by existing Codex backend checks lives on
the AgentIdentity record for now.
- `is_chatgpt_auth()` remains token-specific.
- `uses_codex_backend()` is the broader predicate for ChatGPT-token auth
and AgentIdentity auth.

## Stack

1. https://github.com/openai/codex/pull/18757: full revert
2. https://github.com/openai/codex/pull/18871: isolated Agent Identity
crate
3. This PR: explicit AgentIdentity auth mode and startup task allocation
4. https://github.com/openai/codex/pull/18811: migrate Codex backend
auth callsites through AuthProvider
5. https://github.com/openai/codex/pull/18904: accept AgentIdentity JWTs
and load `CODEX_AGENT_IDENTITY`

## Testing

Tests: targeted Rust checks, cargo-shear, Bazel lock check, and CI.

efrazer-oai · 2026-04-21 22:33:24 -07:00

69c8913e24

feat: add metric to track the number of turns with memory usage (#18662 )

Add a metric `codex.turn.memory` to know if a turn used memories or not.
This is not part of the other turn metrics as a label to limit
cardinality

jif-oai · 2026-04-20 14:31:22 +01:00

2c59806fe0

feat: Budget skill metadata and surface trimming as a warning (#18298 )

Cap the model-visible skills section to a small share of the context
window, with a fallback character budget, and keep only as many implicit
skills as fit within that budget.

Emit a non-fatal warning when enabled skills are omitted, and add a new
app-server warning notification

Record thread-start skill metrics for total enabled skills, kept skills,
and whether truncation happened

---------

Co-authored-by: Matthew Zeng <mzeng@openai.com>
Co-authored-by: Codex <noreply@openai.com>

xl-openai · 2026-04-17 18:11:47 -07:00

3f7222ec76

Stream apply_patch changes (#17862 )

Adds new events for streaming apply_patch changes from responses api.
This is to enable clients to show progress during file writes.

Caveat: This does not work with apply_patch in function call mode, since
that required adding streaming json parsing.

Akshay Nathan · 2026-04-16 18:12:19 -07:00

7995c66032

Add OTEL metrics for hook runs (#18026 )

# Why
We already emit analytics for completed hook runs, but we don't have
matching OTEL metrics to track hook volume and latency.

# What
- add `codex.hooks.run` and `codex.hooks.run.duration_ms`
- tag both metrics with `hook_name`, `source`, and `status`
- emit the metrics from the completed hook path

Verified locally against a dummy OTLP collector

---------

Co-authored-by: Codex <noreply@openai.com>

Abhinav · 2026-04-16 21:30:38 +00:00

d9c71d41a9

[codex] reduce module visibility (#16978 )

## Summary
- reduce public module visibility across Rust crates, preferring private
or crate-private modules with explicit crate-root public exports
- update external call sites and tests to use the intended public crate
APIs instead of reaching through module trees
- add the module visibility guideline to AGENTS.md

## Validation
- `cargo check --workspace --all-targets --message-format=short` passed
before the final fix/format pass
- `just fix` completed successfully
- `just fmt` completed successfully
- `git diff --check` passed

pakrym-oai · 2026-04-07 08:03:35 -07:00

413c1e1fdf

Codex/windows bazel rust test coverage no rs (#16528 )

# Why this PR exists

This PR is trying to fix a coverage gap in the Windows Bazel Rust test
lane.

Before this change, the Windows `bazel test //...` job was nominally
part of PR CI, but a non-trivial set of `//codex-rs/...` Rust test
targets did not actually contribute test signal on Windows. In
particular, targets such as `//codex-rs/core:core-unit-tests`,
`//codex-rs/core:core-all-test`, and `//codex-rs/login:login-unit-tests`
were incompatible during Bazel analysis on the Windows gnullvm platform,
so they never reached test execution there. That is why the
Cargo-powered Windows CI job could surface Windows-only failures that
the Bazel-powered job did not report: Cargo was executing those tests,
while Bazel was silently dropping them from the runnable target set.

The main goal of this PR is to make the Windows Bazel test lane execute
those Rust test targets instead of skipping them during analysis, while
still preserving `windows-gnullvm` as the target configuration for the
code under test. In other words: use an MSVC host/exec toolchain where
Bazel helper binaries and build scripts need it, but continue compiling
the actual crate targets with the Windows gnullvm cfgs that our current
Bazel matrix is supposed to exercise.

# Important scope note

This branch intentionally removes the non-resource-loading `.rs` test
and production-code changes from the earlier
`codex/windows-bazel-rust-test-coverage` branch. The only Rust source
changes kept here are runfiles/resource-loading fixes in TUI tests:

- `codex-rs/tui/src/chatwidget/tests.rs`
- `codex-rs/tui/tests/manager_dependency_regression.rs`

That is deliberate. Since the corresponding tests already pass under
Cargo, this PR is meant to test whether Bazel infrastructure/toolchain
fixes alone are enough to get a healthy Windows Bazel test signal,
without changing test behavior for Windows timing, shell output, or
SQLite file-locking.

# How this PR changes the Windows Bazel setup

## 1. Split Windows host/exec and target concerns in the Bazel test lane

The core change is that the Windows Bazel test job now opts into an MSVC
host platform for Bazel execution-time tools, but only for `bazel test`,
not for the Bazel clippy build.

Files:

- `.github/workflows/bazel.yml`
- `.github/scripts/run-bazel-ci.sh`
- `MODULE.bazel`

What changed:

- `run-bazel-ci.sh` now accepts `--windows-msvc-host-platform`.
- When that flag is present on Windows, the wrapper appends
`--host_platform=//:local_windows_msvc` unless the caller already
provided an explicit `--host_platform`.
- `bazel.yml` passes that wrapper flag only for the Windows `bazel test
//...` job.
- The Bazel clippy job intentionally does **not** pass that flag, so
clippy stays on the default Windows gnullvm host/exec path and continues
linting against the target cfgs we care about.
- `run-bazel-ci.sh` also now forwards `CODEX_JS_REPL_NODE_PATH` on
Windows and normalizes the `node` executable path with `cygpath -w`, so
tests that need Node resolve the runner's Node installation correctly
under the Windows Bazel test environment.

Why this helps:

- The original incompatibility chain was mostly on the **exec/tool**
side of the graph, not in the Rust test code itself. Moving host tools
to MSVC lets Bazel resolve helper binaries and generators that were not
viable on the gnullvm exec platform.
- Keeping the target platform on gnullvm preserves cfg coverage for the
crates under test, which is important because some Windows behavior
differs between `msvc` and `gnullvm`.

## 2. Teach the repo's Bazel Rust macro about Windows link flags and
integration-test knobs

Files:

- `defs.bzl`
- `codex-rs/core/BUILD.bazel`
- `codex-rs/otel/BUILD.bazel`
- `codex-rs/tui/BUILD.bazel`

What changed:

- Replaced the old gnullvm-only linker flag block with
`WINDOWS_RUSTC_LINK_FLAGS`, which now handles both Windows ABIs:
- gnullvm gets `-C link-arg=-Wl,--stack,8388608`
- MSVC gets `-C link-arg=/STACK:8388608`, `-C
link-arg=/NODEFAULTLIB:libucrt.lib`, and `-C link-arg=ucrt.lib`
- Threaded those Windows link flags into generated `rust_binary`,
unit-test binaries, and integration-test binaries.
- Extended `codex_rust_crate(...)` with:
- `integration_test_args`
- `integration_test_timeout`
- Used those new knobs to:
- mark `//codex-rs/core:core-all-test` as a long-running integration
test
- serialize `//codex-rs/otel:otel-all-test` with `--test-threads=1`
- Added `src/**/*.rs` to `codex-rs/tui` test runfiles, because one
regression test scans source files at runtime and Bazel does not expose
source-tree directories unless they are declared as data.

Why this helps:

- Once host-side MSVC tools are available, we still need the generated
Rust test binaries to link correctly on Windows. The MSVC-side
stack/UCRT flags make those binaries behave more like their Cargo-built
equivalents.
- The integration-test macro knobs avoid hardcoding one-off test
behavior in ad hoc BUILD rules and make the generated test targets more
expressive where Bazel and Cargo have different runtime defaults.

## 3. Patch `rules_rs` / `rules_rust` so Windows MSVC exec-side Rust and
build scripts are actually usable

Files:

- `MODULE.bazel`
- `patches/rules_rs_windows_exec_linker.patch`
- `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch`
- `patches/rules_rust_windows_build_script_runner_paths.patch`
- `patches/rules_rust_windows_exec_msvc_build_script_env.patch`
- `patches/rules_rust_windows_msvc_direct_link_args.patch`
- `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch`
- `patches/BUILD.bazel`

What these patches do:

- `rules_rs_windows_exec_linker.patch`
- Adds a `rust-lld` filegroup for Windows Rust toolchain repos,
symlinked to `lld-link.exe` from `PATH`.
- Marks Windows toolchains as using a direct linker driver.
- Supplies Windows stdlib link flags for both gnullvm and MSVC.
- `rules_rust_windows_bootstrap_process_wrapper_linker.patch`
- For Windows MSVC Rust targets, prefers the Rust toolchain linker over
an inherited C++ linker path like `clang++`.
- This specifically avoids the broken mixed-mode command line where
rustc emits MSVC-style `/NOLOGO` / `/LIBPATH:` / `/OUT:` arguments but
Bazel still invokes `clang++.exe`.
- `rules_rust_windows_build_script_runner_paths.patch`
- Normalizes forward-slash execroot-relative paths into Windows path
separators before joining them on Windows.
- Uses short Windows paths for `RUSTC`, `OUT_DIR`, and the build-script
working directory to avoid path-length and quoting issues in third-party
build scripts.
- Exposes `RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER=1` to build scripts so
crate-local patches can detect "this is running under Bazel's
build-script runner".
- Fixes the Windows runfiles cleanup filter so generated files with
retained suffixes are actually retained.
- `rules_rust_windows_exec_msvc_build_script_env.patch`
- For exec-side Windows MSVC build scripts, stops force-injecting
Bazel's `CC`, `CXX`, `LD`, `CFLAGS`, and `CXXFLAGS` when that would send
GNU-flavored tool paths/flags into MSVC-oriented Cargo build scripts.
- Rewrites or strips GNU-only `--sysroot`, MinGW include/library paths,
stack-protector, and `_FORTIFY_SOURCE` flags on the MSVC exec path.
- The practical effect is that build scripts can fall back to the Visual
Studio toolchain environment already exported by CI instead of crashing
inside Bazel's hermetic `clang.exe` setup.
- `rules_rust_windows_msvc_direct_link_args.patch`
- When using a direct linker on Windows, stops forwarding GNU driver
flags such as `-L...` and `--sysroot=...` that `lld-link.exe` does not
understand.
- Passes non-`.lib` native artifacts as explicit `-Clink-arg=<path>`
entries when needed.
- Filters C++ runtime libraries to `.lib` artifacts on the Windows
direct-driver path.
- `rules_rust_windows_process_wrapper_skip_temp_outputs.patch`
- Excludes transient `*.tmp*` and `*.rcgu.o` files from process-wrapper
dependency search-path consolidation, so unstable compiler outputs do
not get treated as real link search-path inputs.

Why this helps:

- The host-platform split alone was not enough. Once Bazel started
analyzing/running previously incompatible Rust tests on Windows, the
next failures were in toolchain plumbing:
- MSVC-targeted Rust tests were being linked through `clang++` with
MSVC-style arguments.
- Cargo build scripts running under Bazel's Windows MSVC exec platform
were handed Unix/GNU-flavored path and flag shapes.
- Some generated paths were too long or had path-separator forms that
third-party Windows build scripts did not tolerate.
- These patches make that mixed Bazel/Cargo/Rust/MSVC path workable
enough for the test lane to actually build and run the affected crates.

## 4. Patch third-party crate build scripts that were not robust under
Bazel's Windows MSVC build-script path

Files:

- `MODULE.bazel`
- `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch`
- `patches/ring_windows_msvc_include_dirs.patch`
- `patches/zstd-sys_windows_msvc_include_dirs.patch`

What changed:

- `aws-lc-sys`
- Detects Bazel's Windows MSVC build-script runner via
`RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER` or a `bazel-out` manifest-dir
path.
- Uses `clang-cl` for Bazel Windows MSVC builds when no explicit
`CC`/`CXX` is set.
- Allows prebuilt NASM on the Bazel Windows MSVC path even when `nasm`
is not available directly in the runner environment.
- Avoids canonicalizing `CARGO_MANIFEST_DIR` in the Bazel Windows MSVC
case, because that path may point into Bazel output/runfiles state where
preserving the given path is more reliable than forcing a local
filesystem canonicalization.
- `ring`
- Under the Bazel Windows MSVC build-script runner, copies the
pregenerated source tree into `OUT_DIR` and uses that as the
generated-source root.
- Adds include paths needed by MSVC compilation for
Fiat/curve25519/P-256 generated headers.
- Rewrites a few relative includes in C sources so the added include
directories are sufficient.
- `zstd-sys`
- Adds MSVC-only include directories for `compress`, `decompress`, and
feature-gated dictionary/legacy/seekable sources.
- Skips `-fvisibility=hidden` on MSVC targets, where that
GCC/Clang-style flag is not the right mechanism.

Why this helps:

- After the `rules_rust` plumbing started running build scripts on the
Windows MSVC exec path, some third-party crates still failed for
crate-local reasons: wrong compiler choice, missing include directories,
build-script assumptions about manifest paths, or Unix-only C compiler
flags.
- These crate patches address those crate-local assumptions so the
larger toolchain change can actually reach first-party Rust test
execution.

## 5. Keep the only `.rs` test changes to Bazel/Cargo runfiles parity

Files:

- `codex-rs/tui/src/chatwidget/tests.rs`
- `codex-rs/tui/tests/manager_dependency_regression.rs`

What changed:

- Instead of asking `find_resource!` for a directory runfile like
`src/chatwidget/snapshots` or `src`, these tests now resolve one known
file runfile first and then walk to its parent directory.

Why this helps:

- Bazel runfiles are more reliable for explicitly declared files than
for source-tree directories that happen to exist in a Cargo checkout.
- This keeps the tests working under both Cargo and Bazel without
changing their actual assertions.

# What we tried before landing on this shape, and why those attempts did
not work

## Attempt 1: Force `--host_platform=//:local_windows_msvc` for all
Windows Bazel jobs

This did make the previously incompatible test targets show up during
analysis, but it also pushed the Bazel clippy job and some unrelated
build actions onto the MSVC exec path.

Why that was bad:

- Windows clippy started running third-party Cargo build scripts with
Bazel's MSVC exec settings and crashed in crates such as `tree-sitter`
and `libsqlite3-sys`.
- That was a regression in a job that was previously giving useful
gnullvm-targeted lint signal.

What this PR does instead:

- The wrapper flag is opt-in, and `bazel.yml` uses it only for the
Windows `bazel test` lane.
- The clippy lane stays on the default Windows gnullvm host/exec
configuration.

## Attempt 2: Broaden the `rules_rust` linker override to all Windows
Rust actions

This fixed the MSVC test-lane failure where normal `rust_test` targets
were linked through `clang++` with MSVC-style arguments, but it broke
the default gnullvm path.

Why that was bad:

-
`@@rules_rs++rules_rust+rules_rust//util/process_wrapper:process_wrapper`
on the gnullvm exec platform started linking with `lld-link.exe` and
then failed to resolve MinGW-style libraries such as `-lkernel32`,
`-luser32`, and `-lmingw32`.

What this PR does instead:

- The linker override is restricted to Windows MSVC targets only.
- The gnullvm path keeps its original linker behavior, while MSVC uses
the direct Windows linker.

## Attempt 3: Keep everything on pure Windows gnullvm and patch the V8 /
Python incompatibility chain instead

This would have preserved a single Windows ABI everywhere, but it is a
much larger project than this PR.

Why that was not the practical first step:

- The original incompatibility chain ran through exec-side generators
and helper tools, not only through crate code.
- `third_party/v8` is already special-cased on Windows gnullvm because
`rusty_v8` only publishes Windows prebuilts under MSVC names.
- Fixing that path likely means deeper changes in
V8/rules_python/rules_rust toolchain resolution and generator execution,
not just one local CI flag.

What this PR does instead:

- Keep gnullvm for the target cfgs we want to exercise.
- Move only the Windows test lane's host/exec platform to MSVC, then
patch the build-script/linker boundary enough for that split
configuration to work.

## Attempt 4: Validate compatibility with `bazel test --nobuild ...`

This turned out to be a misleading local validation command.

Why:

- `bazel test --nobuild ...` can successfully analyze targets and then
still exit 1 with "Couldn't start the build. Unable to run tests"
because there are no runnable test actions after `--nobuild`.

Better local check:

```powershell
bazel build --nobuild --keep_going --host_platform=//:local_windows_msvc //codex-rs/login:login-unit-tests //codex-rs/core:core-unit-tests //codex-rs/core:core-all-test
```

# Which patches probably deserve upstream follow-up

My rough take is that the `rules_rs` / `rules_rust` patches are the
highest-value upstream candidates, because they are fixing generic
Windows host/exec + MSVC direct-linker behavior rather than
Codex-specific test logic.

Strong upstream candidates:

- `patches/rules_rs_windows_exec_linker.patch`
- `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch`
- `patches/rules_rust_windows_build_script_runner_paths.patch`
- `patches/rules_rust_windows_exec_msvc_build_script_env.patch`
- `patches/rules_rust_windows_msvc_direct_link_args.patch`
- `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch`

Why these seem upstreamable:

- They address general-purpose problems in the Windows MSVC exec path:
- missing direct-linker exposure for Rust toolchains
- wrong linker selection when rustc emits MSVC-style args
- Windows path normalization/short-path issues in the build-script
runner
- forwarding GNU-flavored CC/link flags into MSVC Cargo build scripts
- unstable temp outputs polluting process-wrapper search-path state

Potentially upstreamable crate patches, but likely with more care:

- `patches/zstd-sys_windows_msvc_include_dirs.patch`
- `patches/ring_windows_msvc_include_dirs.patch`
- `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch`

Notes on those:

- The `zstd-sys` and `ring` include-path fixes look fairly generic for
MSVC/Bazel build-script environments and may be straightforward to
propose upstream after we confirm CI stability.
- The `aws-lc-sys` patch is useful, but it includes a Bazel-specific
environment probe and CI-specific compiler fallback behavior. That
probably needs a cleaner upstream-facing shape before sending it out, so
upstream maintainers are not forced to adopt Codex's exact CI
assumptions.

Probably not worth upstreaming as-is:

- The repo-local Starlark/test target changes in `defs.bzl`,
`codex-rs/*/BUILD.bazel`, and `.github/scripts/run-bazel-ci.sh` are
mostly Codex-specific policy and CI wiring, not generic rules changes.

# Validation notes for reviewers

On this branch, I ran the following local checks after dropping the
non-resource-loading Rust edits:

```powershell
cargo test -p codex-tui
just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc -- fix -p codex-tui
python .\tools\argument-comment-lint\run-prebuilt-linter.py -p codex-tui
just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc fmt
```

One local caveat:

- `just argument-comment-lint` still fails on this Windows machine for
an unrelated Bazel toolchain-resolution issue in
`//codex-rs/exec:exec-all-test`, so I used the direct prebuilt linter
for `codex-tui` as the local fallback.

# Expected reviewer takeaway

If this PR goes green, the important conclusion is that the Windows
Bazel test coverage gap was primarily a Bazel host/exec toolchain
problem, not a need to make the Rust tests themselves Windows-specific.
That would be a strong signal that the deleted non-resource-loading Rust
test edits from the earlier branch should stay out, and that future work
should focus on upstreaming the generic `rules_rs` / `rules_rust`
Windows fixes and reducing the crate-local patch surface.

Michael Bolin · 2026-04-03 15:34:03 -07:00

eaf12beacf

otel: remove the last workspace crate feature (#16469 )

## Why

`codex-otel` still carried `disable-default-metrics-exporter`, which was
the last remaining workspace crate feature.

We are removing workspace crate features because they do not fit our
current build model well:

- our Bazel setup does not honor crate features today, which can let
feature-gated issues go unnoticed
- they create extra crate build permutations that we want to avoid

For this case, the feature was only being used to keep the built-in
Statsig metrics exporter off in test and debug-oriented contexts. This
repo already treats `debug_assertions` as the practical proxy for that
class of behavior, so OTEL should follow the same convention instead of
keeping a dedicated crate feature alive.

## What changed

- removed `disable-default-metrics-exporter` from
`codex-rs/otel/Cargo.toml`
- removed the `codex-otel` dev-dependency feature activation from
`codex-rs/core/Cargo.toml`
- changed `codex-rs/otel/src/config.rs` so the built-in
`OtelExporter::Statsig` default resolves to `None` when
`debug_assertions` is enabled, with a focused unit test covering that
behavior
- removed the final feature exceptions from
`.github/scripts/verify_cargo_workspace_manifests.py`, so workspace
crate features are now hard-banned instead of temporarily allowlisted
- expanded the verifier error message to explain the Bazel mismatch and
build-permutation cost behind that policy

## How tested

- `python3 .github/scripts/verify_cargo_workspace_manifests.py`
- `cargo test -p codex-otel`
- `cargo test -p codex-core
metrics_exporter_defaults_to_statsig_when_missing`
- `cargo test -p codex-app-server app_server_default_analytics_`
- `just bazel-lock-check`

Michael Bolin · 2026-04-01 13:45:23 -07:00

323aa968c3

chore: clean up argument-comment lint and roll out all-target CI on macOS (#16054 )

## Why

`argument-comment-lint` was green in CI even though the repo still had
many uncommented literal arguments. The main gap was target coverage:
the repo wrapper did not force Cargo to inspect test-only call sites, so
examples like the `latest_session_lookup_params(true, ...)` tests in
`codex-rs/tui_app_server/src/lib.rs` never entered the blocking CI path.

This change cleans up the existing backlog, makes the default repo lint
path cover all Cargo targets, and starts rolling that stricter CI
enforcement out on the platform where it is currently validated.

## What changed

- mechanically fixed existing `argument-comment-lint` violations across
the `codex-rs` workspace, including tests, examples, and benches
- updated `tools/argument-comment-lint/run-prebuilt-linter.sh` and
`tools/argument-comment-lint/run.sh` so non-`--fix` runs default to
`--all-targets` unless the caller explicitly narrows the target set
- fixed both wrappers so forwarded cargo arguments after `--` are
preserved with a single separator
- documented the new default behavior in
`tools/argument-comment-lint/README.md`
- updated `rust-ci` so the macOS lint lane keeps the plain wrapper
invocation and therefore enforces `--all-targets`, while Linux and
Windows temporarily pass `-- --lib --bins`

That temporary CI split keeps the stricter all-targets check where it is
already cleaned up, while leaving room to finish the remaining Linux-
and Windows-specific target-gated cleanup before enabling
`--all-targets` on those runners. The Linux and Windows failures on the
intermediate revision were caused by the wrapper forwarding bug, not by
additional lint findings in those lanes.

## Validation

- `bash -n tools/argument-comment-lint/run.sh`
- `bash -n tools/argument-comment-lint/run-prebuilt-linter.sh`
- shell-level wrapper forwarding check for `-- --lib --bins`
- shell-level wrapper forwarding check for `-- --tests`
- `just argument-comment-lint`
- `cargo test` in `tools/argument-comment-lint`
- `cargo test -p codex-terminal-detection`

## Follow-up

- Clean up remaining Linux-only target-gated callsites, then switch the
Linux lint lane back to the plain wrapper invocation.
- Clean up remaining Windows-only target-gated callsites, then switch
the Windows lint lane back to the plain wrapper invocation.

Michael Bolin · 2026-03-27 19:00:44 -07:00

61dfe0b86c

127 Commits