mirror of
https://github.com/pchuan98/codex.git
synced 2026-07-01 00:31:56 +08:00
dev
127 Commits
-
[codex] Propagate traces through exec-server HTTP (#30117)
Fixes distributed trace continuity across exec-server JSON-RPC HTTP egress by adding an executor client span and injecting its W3C context through a reusable `codex-otel` helper. This preserves the caller trace across core/tool → executor → provider/MCP instead of dropping parentage at raw reqwest. Note that this doesn't include the websocket path, which is needed to really get the full story but at least we cover the basic http path with this change.
Tom ·
2026-06-25 23:22:22 +00:00 -
[codex] Record exec-server lifecycle metrics (#27467)
## Summary - Record bounded connection, request, and process lifecycle metrics. - Report active gauges from callbacks on every collection, including delta exports. - Serialize active-count updates so concurrent starts and finishes cannot publish stale values. - Serialize process exit, explicit termination, and shutdown through the process registry so exactly one completion result wins. - Keep the implementation small with single-owner RAII guards and one real OTLP/HTTP integration test using the existing `wiremock` dependency. ## Root cause Process exit and session shutdown previously used cloned completion state. That avoided duplicate emission, but it duplicated lifecycle ownership and made the ordering harder to reason about. The process registry mutex already defines the lifecycle ordering, so the final implementation stores the metric guard and termination flag directly on the process entry. Whichever path claims the entry first owns the completion result. Production metric export uses delta temporality. Event-only synchronous gauge recordings disappear after the next collection when no count changes, so active counts now use observable callbacks that report current state on every collection. The cleanup also removes the constant `result="accepted"` connection tag, redundant route and response assertions, a custom HTTP collector, and fallback initialization machinery that did not add behavior. ## Stack Review and land this stack in order: 1. #27466 — trace exec-server JSON-RPC requests 2. #27467 — record bounded connection, request, and process lifecycle metrics **(this PR)** 3. #27470 — observe remote registration and Noise rendezvous lifecycle ## Validation - `just test -p codex-exec-server --lib` (158 passed) - `just test -p codex-cli --test exec_server` (3 passed) - `just test -p codex-otel observable_gauge_is_collected_on_every_delta_snapshot` (1 passed) - `CARGO_BUILD_JOBS=1 just fix -p codex-otel -p codex-exec-server` - `just fmt` - `git diff --check`
richardopenai ·
2026-06-25 11:02:11 -07:00 -
feat: use run agent task auth for inference (#19051)
## Stack This is PR 3 of the simplified HAI single-run-task stack: - [#19047](https://github.com/openai/codex/pull/19047) Agent Identity assertion and task-registration primitives, including the shared run-task helper used by existing Agent Identity JWT auth. - [#19049](https://github.com/openai/codex/pull/19049) Disabled-by-default ChatGPT auth opt-in that provisions/reuses persisted Agent Identity runtime auth and its single run task. - [#19051](https://github.com/openai/codex/pull/19051) Run-scoped provider auth that uses one backend-owned task id for first-party inference and compaction requests. [#19054](https://github.com/openai/codex/pull/19054) collapsed out of the active stack because the simplified design no longer needs a separate background/control-plane task helper. ## Summary This PR moves Agent Identity usage into provider auth resolution. That keeps `AgentAssertion` auth tied to first-party OpenAI provider requests instead of applying a late session-wide override that could affect local, custom, Bedrock, API-key, or external-bearer providers. What changed: - adds a small `ProviderAuthScope` struct carrying the run auth policy and session source needed by provider-scoped auth resolution - lets `Session` opt the existing `ModelClient` into `ChatGptAuth` policy when `use_agent_identity` is enabled, without adding a second model-client constructor - resolves Agent Identity only for first-party OpenAI provider auth paths - uses the persisted run task id from the `AgentIdentityAuth` record to build `AgentAssertion` auth for Responses requests - routes shared request setup through scoped provider auth so unary compact requests use the same run-task assertion path as inference turns - keeps local/custom/Bedrock/env-key/external-bearer provider auth unchanged - lets missing run-task state surface through the existing model-request error path instead of silently falling back to bearer auth This PR intentionally does not create thread-scoped, target-scoped, or background-scoped task identities. The run task is the only task Codex registers in this POC shape. ## Testing - `just test -p codex-model-provider` - `just test -p codex-core client::tests::provider_auth_scope_uses` - `just test -p codex-core remote_compact_uses_agent_identity_assertion`
Adrian ·
2026-06-24 22:31:41 -07:00 -
auth: move domain mode below app wire types (#29721)
## Why Authentication mode is a domain concept used by login, model selection, telemetry, and transports. Keeping the canonical type in app-server protocol forces those lower-level crates to depend on an unrelated wire API. ## What changed - Added canonical `codex_protocol::auth::AuthMode` domain values. - Kept the app-server wire DTO unchanged and added an explicit app-side conversion. - Removed production app-server-protocol dependencies from login, model-provider-info, models-manager, and otel call paths. ## Stack This is PR 2 of 6, stacked on [PR #29714](https://github.com/openai/codex/pull/29714). Review only the delta from `codex/split-json-rpc-protocols`. Next: [PR #29722](https://github.com/openai/codex/pull/29722). ## Validation - Auth and login coverage passed in the focused protocol/domain test run. - App-server account and auth conversion coverage passed.
Adam Perry @ OpenAI ·
2026-06-24 03:10:20 +00:00 -
[codex] Use input items for Responses Lite tools (#27946)
When using Responses Lite, we should all use `additional_tools` and a developer item instead of the top level tools array & instructions field. This keeps things 1-to-1. Forced namespacing for _all_ tools will land in a following PR after some coordination & fixes in Responses API (around collisions & return items). The goal is to eventually expand the scope of this to _all_ requests from codex, but that will require larger coordination across providers & slower rollout.
rka-oai ·
2026-06-22 23:56:16 -07:00 -
[codex] Expose service tier and reasoning effort in OTEL (#29155)
## Summary NVIDIA asked to measure Fast mode usage and reasoning effort from Codex CLI OTEL logs. Add the finalized `service_tier` and `model_reasoning_effort` to the existing `codex.sse_event` `response.completed` record. This intentionally reuses the existing completion event and leaves transport APIs and shared telemetry plumbing unchanged. ## Testing - `cargo build -p codex-cli --bin codex` - `just test -p codex-core responses_api_emits_api_request_event` - End-to-end with the built CLI and a local OTLP/HTTP collector: - Fast/high emitted `service_tier=priority` and `model_reasoning_effort=high` with token usage. - Standard/low omitted `service_tier` and emitted `model_reasoning_effort=low` with token usage.
daniel-oai ·
2026-06-22 20:44:48 -07:00 -
Stop logging every Responses WebSocket event (#29432)
## Why Every successful Responses WebSocket event currently produces three local log records: the full payload at TRACE, an OpenTelemetry log event, and an OpenTelemetry trace event. On busy threads these records fill the 1,000-row log partition in seconds and cause continuous SQLite insert-and-prune churn. Related to https://openai.slack.com/archives/C095U48JNL9/p1782128972644209 ## What changed - Stop logging each successful Responses WebSocket payload at TRACE. - Stop emitting `codex.websocket_event` as OpenTelemetry log and trace events. - Keep WebSocket event counters, duration metrics, response timing metrics, parsing, and error handling.
jif ·
2026-06-22 17:43:08 +02:00 -
Propagate safety buffering events to app-server clients (#29371)
Responses API safety buffering metadata currently stops at the transport boundary, so app-server clients cannot render the in-progress safety review state. This change: - decodes and deduplicates `safety_buffering` metadata from Responses API SSE and WebSocket events without suppressing the original response event - emits a typed core event containing the requested model plus backend use cases and reasons - forwards that event as `turn/safetyBuffering/updated` through app-server v2 and updates generated protocol schemas - keeps the side-channel event out of persisted rollouts and turn timing This supports the Codex Apps buffering UX and depends on the Responses API backend work in https://github.com/openai/openai/pull/1044569 and https://github.com/openai/openai/pull/1044571. Validation: - focused `codex-core` safety-buffering integration test passes - `cargo check -p codex-core -p codex-app-server -p codex-app-server-protocol` - `just fix -p codex-api -p codex-protocol -p codex-core -p codex-app-server-protocol -p codex-app-server -p codex-rollout -p codex-rollout-trace -p codex-otel` - `just fmt` - broad package test run: 4,430/4,492 passed; 62 unrelated local-environment/concurrency failures involved unavailable test binaries, MCP subprocess setup, and app-server timeouts
Francis Chalissery ·
2026-06-22 03:39:14 +00:00 -
[codex] Use expect in integration tests (#28441)
The workspace denies `clippy::expect_used` in production. Although `clippy.toml` allows `expect` in tests, Bazel Clippy compiles integration-test helper code in a way that does not receive that exemption, which encouraged verbose `unwrap_or_else(... panic!(...))` and equivalent `match`/`let else` forms. This allows `clippy::expect_used` once at each integration-test crate root (including aggregated suites and test-support libraries), then replaces manual panic-based Result and Option unwraps with `expect`/`expect_err`. Standalone `tests/*.rs` files remain their own crate roots. Intentional assertion and unexpected-variant panics remain unchanged, and the production `expect_used = "deny"` lint remains in place. The cleanup is mechanical and net-negative in line count.
pakrym-oai ·
2026-06-15 21:53:47 -07:00 -
[codex] Add second-based OTEL duration histograms (#27058)
## Why Exec-server request and connection latencies need fractional-second histograms. The existing duration API records integer milliseconds and uses millisecond-scale buckets. ## What changed - Adds a described duration API that records `Duration` values as fractional seconds. - Uses second-scale explicit histogram boundaries. - Caches duration histograms by name, unit, and description, matching the existing instrument caching model. - Covers exact boundaries, representative bucket placement, fractional sums, and exported metadata. This PR only adds the duration primitive. It does not add exec-server adoption. ## Stack 1. #26091: counter descriptions 2. #27057: gauge instruments 3. **#27058: second-based duration histograms** 4. #25019: initialize exec-server OpenTelemetry at startup Related independent coverage: #27059 tests OTLP HTTP log and trace event export. ## Validation - `just test -p codex-otel`
richardopenai ·
2026-06-15 17:10:52 -07:00 -
feat(core): add metadata field to ResponseItem (#28355)
## Description This PR adds an optional `metadata` field to `ResponseItem` for Responses API calls. Only mechanical plumbing, no actual values populated and sent yet. Turns out just adding a new field to `ResponseItem` has quite a large blast radius already. This change is backwards compatible because `metadata` is optional and omitted when absent, so existing response items and rollout history without it still deserialize and requests that do not set it keep the same wire shape. For provider compatibility, we strip out `metadata` before non-OpenAI Responses requests so Azure and AWS Bedrock never see this field. My followup PR here will actually make use of it to start storing and passing along `turn_id`: https://github.com/openai/codex/pull/28360 ## What changed - Added `ResponseItemMetadata` with optional `turn_id`, plus optional `metadata` on Responses API item variants and inter-agent communication. - Preserved item metadata through response-item rewrites such as truncation, missing tool-output synthesis, compaction history rebuilding, visible-history conversion, rollout/resume, and generated app-server schemas/types. - Strip item metadata from non-OpenAI Responses requests while preserving it for OpenAI-shaped requests. - Updated the mechanical fixture/test construction churn required by the new optional field.
Owen Lin ·
2026-06-15 15:05:28 -07:00 -
[codex] Cover OTLP HTTP log and trace event export (#27059)
## Why The generic OTLP HTTP paths for log events and trace events need end-to-end coverage before exec-server relies on them. ## What changed - Adds loopback coverage for exporting `codex_otel.log_only` events to `/v1/logs`. - Verifies `codex_otel.trace_safe` events are present in the exported trace payload. This is a test-only PR. It does not change OTEL runtime behavior or metric APIs. ## Related work - #26091: counter descriptions - #27057: gauge instruments - #27058: second-based duration histograms This PR is independent and can land directly on `main`. ## Validation - `just test -p codex-otel` - `just fix -p codex-otel` - `just fmt`
richardopenai ·
2026-06-15 09:59:26 -07:00 -
feat: add Bedrock API key as a managed auth mode (#27443)
## Why Codex needs to manage Amazon Bedrock API key credentials through the existing auth lifecycle instead of introducing a separate auth manager or provider-specific credential file. Treating Bedrock API key login as a primary auth mode gives it the same persistence, keyring, reload, and logout behavior as the existing OpenAI API key and ChatGPT modes. The credential is valid only for the `amazon-bedrock` model provider. OpenAI-compatible providers must reject this auth mode rather than treating the Bedrock key as an OpenAI bearer token. ## What changed - Added `bedrockApiKey` as an app-server `AuthMode` and `CodexAuth::BedrockApiKey` as a primary `AuthManager` mode. - Added `BedrockApiKeyAuth`, containing the API key and AWS region, to the existing `AuthDotJson` payload stored in `$CODEX_HOME/auth.json` or the configured keyring backend. - Added `login_with_bedrock_api_key(...)`, parallel to `login_with_api_key(...)`, which replaces the current stored login with Bedrock credentials. - Reused generic auth reload and logout behavior instead of adding a Bedrock-specific auth manager or logout path. - Updated login restrictions, status reporting, diagnostics, telemetry classification, generated app-server schemas, and auth fixtures for the new mode. - Added explicit errors when Bedrock API key auth is selected with an OpenAI-compatible model provider. This PR establishes managed storage and auth-mode behavior. Routing the managed key and region into Amazon Bedrock requests will be in follow-up PRs.
Celia Chen ·
2026-06-10 20:42:38 -07:00 -
[codex] Add reusable OTEL gauge instruments (#27057)
## Why Exec-server observability needs current-value measurements in addition to counters. The reusable OTEL client should expose that primitive without coupling it to exec-server runtime behavior. ## What changed - Adds integer gauge instruments, with optional descriptions. - Caches gauges by name and description so instrument metadata remains part of the declaration identity. - Covers gauge values, descriptions, merged attributes, and OTLP HTTP export. This PR only adds the gauge primitive. It does not add second-based duration histograms or exec-server adoption. ## Stack 1. #26091: counter descriptions 2. **#27057: gauge instruments** 3. #27058: second-based duration histograms Related independent coverage: #27059 tests OTLP HTTP log and trace event export. ## Validation - `just test -p codex-otel` - `just fix -p codex-otel` - `just fmt`
richardopenai ·
2026-06-10 21:36:38 +00:00 -
[codex] Add OTEL counter descriptions (#26091)
## Why Metric descriptions should be declared with reusable OTEL instruments instead of being coupled to individual consumers. Counter descriptions are the smallest API primitive needed by the exec-server observability work. ## What changed - Adds `counter_with_description` while preserving the existing counter API. - Caches counters by name and description so instrument metadata remains part of the declaration identity. - Covers the exported description together with the existing value and attribute contract. This PR only adds counter descriptions. It does not add gauges, second-based durations, or exec-server adoption. ## Stack 1. **#26091: counter descriptions** 2. #27057: gauge instruments 3. #27058: second-based duration histograms Related independent coverage: #27059 tests OTLP HTTP log and trace event export. The `codex-exec-server` bounded service tag now stays with the exec-server adoption change instead of this reusable infrastructure stack. ## Validation - `just test -p codex-otel` - `just fix -p codex-otel` - `just fmt`
richardopenai ·
2026-06-08 22:29:51 +00:00 -
[codex-rs] support v2 personal access tokens (#25731)
## Summary - add v2 personal access token support for `codex login --with-access-token` and `CODEX_ACCESS_TOKEN` - classify opaque `at-` tokens separately from legacy Agent Identity JWTs - hydrate required ChatGPT account metadata through AuthAPI `/v1/user-auth-credential/whoami` - use PATs directly as bearer tokens while preserving existing ChatGPT account surfaces - expose PAT-backed auth as the explicit `personalAccessToken` app-server auth mode ## Implementation PAT auth is intentionally small and stateless. Loading a PAT performs one AuthAPI metadata request, stores the hydrated metadata in the in-memory auth object, and redacts the secret from debug output. Legacy Agent Identity JWT handling remains unchanged. The shared access-token classifier lives in a private neutral module because it dispatches between both credential types. PAT hydration fails closed when AuthAPI omits any required metadata, including email. Hydrated metadata is intentionally not persisted: startup performs a live `whoami` preflight so revoked tokens or changed account metadata are not accepted from a stale cache. ## Workspace restriction scope This change intentionally does **not** apply `forced_chatgpt_workspace_id` to PAT authentication. The setting is a client-side config guardrail, not an authorization boundary, and PAT does not currently require workspace-ID parity. The PAT login and `CODEX_ACCESS_TOKEN` paths therefore validate through AuthAPI without threading workspace-restriction state through access-token loading. Existing workspace checks for non-PAT auth remain on their established paths. ## App-server compatibility The public app-server `AuthMode` is shared across v1 and v2, and PAT-backed auth reports `personalAccessToken` through both APIs. Following human review, this intentionally removes the temporary v1 compatibility mapping that reported PATs as `chatgpt`; the deprecated v1 API is kept in parity with v2 rather than maintaining a separate closed enum. Clients with exhaustive auth-mode handling in either API version must add the new case and should generally treat it as ChatGPT-backed unless they need PAT-specific behavior. The v1 auth-status response still omits the raw PAT when `includeToken` is requested because that response cannot carry the account metadata needed to reuse the credential safely. Persisted PAT auth also omits the new enum value so older Codex builds can deserialize `auth.json` and infer PAT auth from the credential field after a rollback. ## Validation Latest review-fix validation: - `CARGO_INCREMENTAL=0 just test -p codex-login` (126 passed) - `CARGO_INCREMENTAL=0 just test -p codex-cli` (263 passed) - `CARGO_INCREMENTAL=0 just test -p codex-cli stored_auth_validation_handles_personal_access_token` - `CARGO_INCREMENTAL=0 just test -p codex-app-server-protocol` (226 passed) - `CARGO_INCREMENTAL=0 just test -p codex-models-manager refresh_available_models_uses_remote_only_catalog_for_chatgpt_auth` - `CARGO_INCREMENTAL=0 just test -p codex-tui existing_non_oauth_chatgpt_login_counts_as_signed_in` - `CARGO_INCREMENTAL=0 just fix -p codex-login -p codex-app-server-protocol -p codex-models-manager -p codex-tui -p codex-cli` - `just fmt` - `git diff --check` The broader `codex-tui` suite previously compiled and ran 2,834 tests. Three unrelated environment-sensitive guardian/IDE-socket tests failed after retries; the PAT-relevant TUI coverage passed.
cooper-oai ·
2026-06-05 17:36:18 -07:00 -
[codex] Forward turn moderation metadata through app-server (#25710)
## Why First-party backends can supply turn-scoped moderation metadata that app-server clients need for client-side presentation. Exposing this as an experimental typed notification lets opted-in clients consume it without interpreting raw Responses API events. ## What changed - forward `response.metadata.openai_chatgpt_moderation_metadata` from Responses API SSE and WebSocket streams as turn-scoped moderation metadata - emit the experimental app-server v2 `turn/moderationMetadata` notification with `{ threadId, turnId, metadata }` - add app-server integration coverage for the typed moderation metadata notification ## Testing - `just test -p codex-core build_ws_client_metadata_includes_window_lineage_and_turn_metadata` - `just test -p codex-core` (fails locally: 46 failures and 1 timeout, primarily missing `test_stdio_server` and shell snapshot timeouts) - `just test -p codex-app-server-protocol` - `just test -p codex-app-server turn_moderation_metadata_emits_typed_notification_v2` - `just test -p codex-app-server` (fails locally: 792 passed, 10 failed, and 5 timed out; failures are in existing environment-sensitive tests, primarily because nested macOS `sandbox-exec` is not permitted) - `just write-app-server-schema --experimental --schema-root /tmp/codex-app-server-schema-experimental`carlc-oai ·
2026-06-05 02:41:06 -07:00 -
Encrypt multi-agent v2 message payloads (#26210)
## Why Multi-agent v2 currently routes agent instructions through normal tool arguments and inter-agent context. That means the parent model can emit plaintext task text, Codex can persist it in history/rollouts, and the recipient can receive it as ordinary assistant-message JSON. This changes the v2 path so agent instructions stay encrypted between model calls: Responses encrypts the `message` argument returned by the model, Codex forwards only that ciphertext, and Responses decrypts it internally for the recipient model. ## What changed - Mark the v2 `message` parameter as encrypted for `spawn_agent`, `send_message`, and `followup_task`. - Treat multi-agent v2 tool `message` values as ciphertext unconditionally. - Store v2 inter-agent task text in `InterAgentCommunication.encrypted_content` with empty plaintext `content`. - Convert encrypted inter-agent communications into the Responses `agent_message` input item before sending the child request. - Preserve `agent_message` items across history, rollout, compaction, telemetry, and app-server schema paths. - Leave multi-agent v1 unchanged. ## Message shape The model still calls the v2 tools with a `message` argument, but that value is now ciphertext: ```json { "name": "spawn_agent", "arguments": { "task_name": "worker", "message": "<ciphertext>" } } ``` Codex stores the task as encrypted inter-agent communication: ```json { "author": "/root", "recipient": "/root/worker", "content": "", "encrypted_content": "<ciphertext>", "trigger_turn": true } ``` When Codex builds the recipient request, it forwards the ciphertext using the new Responses input item: ```json { "type": "agent_message", "author": "/root", "recipient": "/root/worker", "content": [ { "type": "encrypted_content", "encrypted_content": "<ciphertext>" } ] } ``` Responses decrypts that item internally for the recipient model. ## Context impact - Parent context no longer carries plaintext v2 agent task instructions from these tool arguments. - Codex rollout/history stores ciphertext for v2 agent instructions. - Recipient requests receive an `agent_message` item instead of assistant commentary JSON for encrypted task delivery. - Plaintext completion/status notifications are still plaintext because they are Codex-generated status messages, not encrypted model tool arguments. ## Validation - `just test -p codex-tools` - `just test -p codex-protocol` - `just test -p codex-rollout` - `just test -p codex-rollout-trace` - `just test -p codex-otel` - `just write-app-server-schema`jif ·
2026-06-05 10:25:57 +02:00 -
[codex] Emit sandbox outcome telemetry event (#25955)
## Summary Adds a dedicated `codex.sandbox_outcome` telemetry event so we can query sandbox edge outcomes without threading sandbox metadata through tool-result output types. This is meant to make sandbox failures and approved escalation retries visible in OTEL while keeping the existing `codex.tool_result` event shape focused on tool completion data. ## What changed - Adds `SessionTelemetry::sandbox_outcome(...)`, which emits `codex.sandbox_outcome` as both a log and trace event. - Records the tool name, call id, sandbox outcome, initial attempt duration, and escalated attempt duration when a retry runs. - Emits `denied` when the sandbox blocks execution and no retry is run. - Emits `timed_out` and `signal` when those sandbox errors surface from tool execution. - Emits `escalated` when the initial sandboxed attempt fails and the approved unsandboxed retry succeeds. - Adds OTEL coverage for the new event payload, including timing fields. ## Validation - `RUST_MIN_STACK=8388608 just test -p codex-core sandbox_outcome_event_records_outcome handle_sandbox_error_user_approves_retry_records_tool_decision` - `just test -p codex-otel otel_export_routing_policy_routes_tool_result_log_and_trace_events runtime_metrics_summary_collects_tool_api_and_streaming_metrics` - `just fix -p codex-core` - `just fix -p codex-otel`
rreichel3-oai ·
2026-06-04 20:58:14 -04:00 -
[codex] Support model-defined reasoning efforts (#26444)
## Summary - accept non-empty model-defined reasoning effort values while preserving built-in effort behavior - propagate the non-Copy effort type through core, app-server, TUI, telemetry, and persistence call sites - preserve string wire encoding and expose an open-string schema for clients - update model selection and shortcut behavior for model-advertised effort values ## Root cause `ReasoningEffort` gained a string-backed custom variant, so it could no longer implement `Copy` or rely on derived closed-enum serialization. Existing consumers still moved effort values from shared references and assumed a fixed built-in value set. ## Validation - `just fmt` - Local tests and compilation were not run per request; relying on CI.
Ahmed Ibrahim ·
2026-06-04 13:36:24 -07:00 -
Add Guardian review metrics (#24897)
## Why Guardian reviews already emit analytics events, but we do not expose aggregate OpenTelemetry metrics for review volume, latency, token usage, or terminal outcomes. That makes it harder to monitor Guardian behavior during rollouts and to compare review outcomes by source, action type, session kind, model, and failure mode. ## What Changed - Added Guardian review metric names for count, total duration, time to first token, and token usage in `codex-rs/otel`. - Added `core/src/guardian/metrics.rs` to convert `GuardianReviewAnalyticsResult` into sanitized metric tags covering decision, terminal status, failure reason, approval request source, reviewed action, session kind, risk/outcome, model, reasoning effort, and context/truncation state. - Emitted the new metrics from `track_guardian_review` for each terminal Guardian review result. ## Testing - Added `guardian_review_metrics_record_counts_durations_and_token_usage`, which verifies the emitted count, duration, TTFT, token usage histograms, and tag set through the in-memory metrics exporter.
jif-oai ·
2026-05-28 14:07:25 +02:00 -
otel: drop legacy profile usage telemetry (#24061)
## Summary - drop the dead legacy profile usage metric and active-profile conversation-start fields - update role comments so they describe provider and service-tier preservation without legacy config-profile wording - pair the code cleanup with the file-backed profile docs update in openai/developers-website#1476 ## Testing - `just fmt` - `cargo test -p codex-otel` - `cargo test -p codex-core` *(fails: existing stack overflow in `mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message`)* - `cargo test -p codex-core --lib mcp_tool_call::tests::guardian_mode_mcp_denial_returns_rationale_message` *(fails with the same stack overflow)*
jif-oai ·
2026-05-22 13:14:44 +02:00 -
Split plugin install discovery into list and request tools (#23372)
## Summary - Add `list_available_plugins_to_install` as the inventory step for plugin and connector install suggestions. - Slim `request_plugin_install` so it only handles the actual elicitation, instead of carrying the full discoverable list in its prompt. - Emit send-time telemetry when an install elicitation is dispatched, including requested tool identity in the event payload. - Emit install-result telemetry through `SessionTelemetry`, including tool type, user response action, and completion status. - Update registration and tests to cover the new two-step flow while keeping the existing `tool_suggest` feature gate unchanged. ## Testing - `just fmt` - `cargo test -p codex-tools` - `cargo test -p codex-core request_plugin_install` - `cargo test -p codex-core list_available_plugins_to_install` - `cargo test -p codex-core install_suggestion_tools_can_be_registered_without_search_tool` - `cargo test -p codex-otel manager_records_plugin_install_suggestion_metric` - `cargo test -p codex-otel manager_records_plugin_install_elicitation_sent_metric` - `just fix -p codex-core` - `just fix -p codex-tools` - `just fix -p codex-otel` - `cargo check -p codex-core`
Matthew Zeng ·
2026-05-19 14:45:37 -07:00 -
goal: pause continuation loops on usage limits and blockers (#23094)
Addresses #22833, #22245, #23067 ## Why `/goal` can keep synthesizing turns even when the next turn cannot make meaningful progress. Hard usage exhaustion can replay failing turns, and repeated permission or external-resource blockers can keep burning tokens while waiting for user or system intervention. ## What changed - Add resumable `blocked` and `usageLimited` goal states. As with `paused`, goal continuation stops with these states. - Move to `usageLimited` after usage-limit failures. - Allow the built-in `update_goal` tool to set `blocked` only under explicit repeated-impasse guidance. Updated goal continuation prompt to specify that agent should use `blocked` only when it has made at least three attempts to get past an impasse. Most of the files touched by this PR are because of the small app server protocol update. ## Validation I manually reproduced a number of situations where an agent can run into a true impasse and verified that it properly enters `blocked` state. I then resumed and verified that it once again entered `blocked` state several turns later if the impasse still exists. I also manually reproduced the usage-limit condition by creating a simulated responses API endpoint that returns 429 errors with the appropriate error message. Verified that the goal runtime properly moves the goal into `usageLimited` state and TUI UI updates appropriately. Verified that `/goal resume` resumes (and immediately goes back into `ussageLImited` state if appropriate). ## Follow-up PRs Small changes will be needed to the GUI clients to properly handle the two new states.
Eric Traut ·
2026-05-18 11:28:53 -07:00 -
chore: goal resumed metrics (#23301)
Add metrics for goal resume
jif-oai ·
2026-05-18 15:19:23 +02:00 -
Preserve image detail in app-server inputs (#20693)
## Summary - Add optional image detail to user image inputs across core, app-server v2, thread history/event mapping, and the generated app-server schemas/types. - Preserve requested detail when serializing Responses image inputs: omitted detail stays on the existing `high` default, while explicit `original` keeps local images on the original-resolution path. - Support `high`/`original` consistently for tool image outputs, including MCP `codex/imageDetail`, code-mode image helpers, and `view_image`.
Curtis 'Fjord' Hawthorne ·
2026-05-15 15:04:04 -07:00 -
[codex] Use compaction_trigger item for remote compaction v2 (#22809)
## Why Remote compaction v2 was still using `context_compaction` as both the request trigger and the compacted output shape. The Responses API now has the landed contract for this flow: Codex sends a dedicated `{ "type": "compaction_trigger" }` input item, and the backend returns the standard `compaction` output item with encrypted content. This aligns the v2 path with that wire contract while preserving the existing local compacted-history post-processing behavior. ## What changed - Add `ResponseItem::CompactionTrigger` and regenerate the app-server protocol schema fixtures. - Send `compaction_trigger` from `remote_compaction_v2` instead of a payload-less `context_compaction`. - Collect exactly one backend `compaction` output item, then reuse the existing compacted-history rebuilding path. - Treat the trigger item as a transient request marker rather than model output or persisted rollout/memory content. ## Verification - `cargo test -p codex-protocol compaction_trigger` - `cargo test -p codex-core remote_compact_v2` - `cargo test -p codex-core compact_remote_v2` - `cargo test -p codex-core responses_websocket_sends_response_processed_after_remote_compaction_v2` - `just write-app-server-schema` - `cargo test -p codex-app-server-protocol schema_fixtures`jif-oai ·
2026-05-15 11:40:35 +02:00 -
Simplify MCP tool handler plumbing (#21595)
## Why The MCP tool path had accumulated a few core-owned special cases: a dedicated payload variant, resolver plumbing, a legacy `AfterToolUse` translation path, and a side channel for parallel-call metadata. That made `ToolRegistry` and the spec builder know more about MCP than they needed to. This change moves MCP-specific execution details back onto `ToolInfo` and `McpHandler` so `codex-core` can treat MCP calls like normal function calls while still preserving MCP-specific dispatch and telemetry behavior where it belongs. ## What changed - removed `resolve_mcp_tool_info`, `ToolPayload::Mcp`, `ToolKind`, and the remaining registry-side MCP resolver path - stored MCP routing metadata directly on `McpHandler` and `ToolInfo`, including `supports_parallel_tool_calls` - deleted the legacy `AfterToolUse` consumer in `core`, which removes the need for handler-specific `after_tool_use_payload` implementations - switched tool-result telemetry to handler-provided tags and kept MCP-specific dispatch payload construction inside the handler - simplified tool spec planning/building by passing `ToolInfo` directly and dropping the direct/deferred MCP wrapper structs and the parallel-server side table ## Testing - `cargo check -p codex-core -p codex-mcp -p codex-otel` - `cargo test -p codex-core mcp_parallel_support_uses_exact_payload_server` - `cargo test -p codex-core direct_mcp_tools_register_namespaced_handlers` - `cargo test -p codex-core search_tool_description_lists_each_mcp_source_once` - `cargo test -p codex-mcp list_all_tools_uses_startup_snapshot_while_client_is_pending` - `just fix -p codex-core -p codex-mcp -p codex-otel`
pakrym-oai ·
2026-05-12 00:11:31 +00:00 -
Add production startup and TTFT telemetry (#22198)
## Why While investigating `codex exec hi` startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?" The path we observed has a few distinct stages: 1. `thread/start` creates the session 2. startup prewarm builds the turn context, tools, and prompt 3. startup prewarm warms the websocket 4. the first real turn resolves the prewarm 5. the model produces the first token Before this PR, production telemetry had some of the raw measurements already: - aggregate startup-prewarm duration / age-at-first-turn metrics - TTFT as a metric - websocket request telemetry But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation. ## What changed Add durable production telemetry on the existing `SessionTelemetry` path: - new `codex.startup_phase` OTel log/trace events plus `codex.startup.phase.duration_ms` - new `codex.turn_ttft` OTel log/trace events while preserving the existing TTFT metric The startup phase event is emitted for the coarse buckets we actually observed while running `exec hi`: - `thread_start_create_thread` - `startup_prewarm_total` - `startup_prewarm_create_turn_context` - `startup_prewarm_build_tools` - `startup_prewarm_build_prompt` - `startup_prewarm_websocket_warmup` - `startup_prewarm_resolve` These phases are intentionally low-cardinality so they remain safe as production telemetry tags. ## Why this shape This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior: - prewarm still runs - no control flow changes - no extra remote calls - no user-visible behavior changes One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after `thread/start` are now covered durably. ## Verification - `cargo test -p codex-otel` - `cargo test -p codex-core turn_timing` - `cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm` - `cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted` - `cargo test -p codex-app-server thread_start` - `just fix -p codex-otel -p codex-core -p codex-app-server` I also ran `cargo test -p codex-core`; it built successfully and then hit an existing unrelated stack overflow in `tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed`.
Matthew Zeng ·
2026-05-11 23:58:36 +00:00 -
Add process-scoped SQLite telemetry (#22154)
## Summary - add SQLite init, backfill-gate, and fallback telemetry without introducing a cross-cutting state-db access wrapper - install one process-scoped telemetry sink after OTEL startup and let low-level state/rollout paths emit through it directly - add process-start metrics for the process owners that initialize SQLite --------- Co-authored-by: Owen Lin <owen@openai.com>
jif-oai ·
2026-05-11 11:32:40 -07:00 -
codex-otel: validate provider span attributes consistently (#21749)
Provider initialization installs process-global OTEL state, so invalid trace metadata needs to fail before setup begins. Use the same span attribute validator as config loading when traces are exported so provider startup enforces the config contract without duplicating validation logic.
bbrown-oai ·
2026-05-08 08:20:49 -07:00 -
codex-otel: add configurable trace metadata (#21556)
Add Codex config for static trace span attributes and structured W3C tracestate field upserts. The config flows through OtelSettings so callers can attach trace metadata without touching every span call site. Apply span attributes with an SDK span processor so every exported trace span carries the configured metadata. Model tracestate as nested member fields so configured keys can be upserted while unrelated propagated state in the same member is preserved. Validate configured tracestate before installing provider-global state, including header-unsafe values the SDK does not reject by itself. This keeps Codex from propagating malformed trace context from config. Update the config schema, public docs, and OTLP loopback coverage for config parsing, span export, propagation, and invalid-header rejection.
bbrown-oai ·
2026-05-07 16:06:57 -07:00 -
revert legacy notify deprecation (#21152)
# Why Revert #20524 for now because the computer use plugin has not migrated off legacy `notify` yet. Keeping the deprecation in place today would show users a warning before the plugin path is ready to move, so this rolls the change back until that migration is complete. # What - revert the legacy `notify` deprecation change from #20524 - restore the prior `notify` behavior and remove the temporary deprecation metrics/docs from that change Once the computer use plugin has migrated, we can land the same deprecation again.
Abhinav ·
2026-05-05 10:34:44 -07:00 -
Add goal lifecycle metrics (#20799)
## Why Adding goal metrics makes it possible to track how often goals are created, completed, and stopped by budget limits, plus the final token and wall-clock usage for terminal outcomes. ## What Changed - Added OpenTelemetry metric constants for goal lifecycle tracking: - `codex.goal.created`: increments each time a new persisted goal is created or an existing goal is replaced with a new objective. - `codex.goal.completed`: increments when a goal transitions to `complete`. - `codex.goal.budget_limited`: increments when a goal transitions to `budget_limited` because its token budget has been reached. - `codex.goal.token_count`: records the final persisted token count when a goal transitions to `complete` or `budget_limited`. - `codex.goal.duration_s`: records the final persisted elapsed wall-clock time, in seconds, when a goal transitions to `complete` or `budget_limited`. - Emitted creation metrics when a goal is created or replaced. - Emitted terminal outcome counters and final usage histograms when a goal transitions to `complete` or `budget_limited`, avoiding double-counting later in-flight accounting for already budget-limited goals. - Added focused `codex-core` tests for create/complete metrics and one-time budget-limit metrics.
Eric Traut ·
2026-05-05 09:21:54 -07:00 -
feat: add remote compaction v2 Responses client path (#20773)
## Why This adds the `remote_compaction_v2` client path so remote compaction can run through the normal Responses stream and install a `context_compaction` item that trigger a compaction. The goal is to migrate some of the compaction logic on the client side We keeps the v2 transport behind a feature flag while letting follow-up requests reuse the compacted context instead of falling back to the legacy compaction item shape. ## What changed - add `ResponseItem::ContextCompaction` and refresh the generated app-server / schema / TypeScript fixtures that expose response items on the wire - add `core/src/compact_remote_v2.rs` to send compaction through the standard streamed Responses client, require exactly one `context_compaction` output item, and install that item into compacted history - route manual compact and auto-compaction through the v2 path when `remote_compaction_v2` is enabled, while keeping the existing remote compaction path as the fallback - preserve the new item type across history retention, follow-up request construction, telemetry, rollout persistence, and rollout-trace normalization - add targeted coverage for the feature flag, `context_compaction` serialization, rollout-trace normalization, and remote-compaction follow-up behavior ## Verification - added protocol tests for `context_compaction` serialization/deserialization in `protocol/src/models.rs` - added rollout-trace coverage for `context_compaction` normalization in `rollout-trace/src/reducer/conversation_tests.rs` - added remote compaction integration coverage for v2 follow-up reuse and mixed compaction output streams in `core/tests/suite/compact_remote.rs` --------- Co-authored-by: Codex <noreply@openai.com>
jif-oai ·
2026-05-04 14:15:01 +02:00 -
deprecate legacy notify (#20524)
# Why `notify` is the remaining compatibility surface from the legacy hook implementation. The newer lifecycle hook engine now owns the active hook system, so we should start steering users away from adding new `notify` configs before removing the old path entirely. This also adds a lightweight watchpoint for the deprecation so we can see how much legacy usage remains before the clean drop. # What - emit a startup deprecation notice when a non-empty `notify` command is configured - emit `codex.notify.configured` when a session starts with legacy `notify` configured - emit `codex.notify.run` when the legacy notify path fires after a completed turn - mark `notify` as deprecated in the config schema and repo docs - remove the orphaned `codex-rs/hooks/src/user_notification.rs` file that is no longer compiled - add regression coverage for the new deprecation notice # Next steps A follow-up PR can remove the legacy notify path entirely once we are ready for the clean drop. Before then, we can watch `codex.notify.configured` and `codex.notify.run` to understand the deprecation impact and remaining active usage. The cleanup PR should then delete the `notify` config field, the `legacy_notify` implementation, the old compatibility dispatch types and callsites that only exist for the legacy path, and the remaining compatibility docs/tests. # Testing - `cargo test -p codex-hooks` - `cargo test -p codex-config` - `cargo test -p codex-core emits_deprecation_notice_for_notify`
Abhinav ·
2026-05-01 17:35:21 +00:00 -
install WFP filters for Windows sandbox setup (#20101)
## Summary This PR installs a first wave of WFP (Windows Filtering Platform) filters that reduce the surface area of network egress vulnerabilities for the Windows Sandbox. - Add persistent Windows Filtering Platform provider, sublayer, and filters for the Windows sandbox offline account. - Install WFP filters during elevated full setup, log failures non-fatally, and emit setup metrics when analytics are enabled. - Bump the Windows sandbox setup version so existing users rerun full setup and receive the new filters. ## What WFP is Windows Filtering Platform (WFP) is the low-level Windows networking policy engine underneath things like Windows Firewall. It lets privileged code install persistent filtering rules at specific network stack layers, with conditions like "only traffic from this Windows account" or "only this remote port," and an action like block. In this change, we create a Codex-owned persistent WFP provider and sublayer, then install block filters scoped to the Windows sandbox's offline user account via `ALE_USER_ID`. That means the filters are targeted at sandboxed processes running as that account, rather than globally affecting the host. ## Initial filter set We are starting with 12 concrete WFP filters across a few high-value bypass surfaces. The table below describes the filter families rather than one filter per row: | Area | Concrete filters | Purpose | | --- | --- | --- | | ICMP | 4 filters: ICMP v4/v6 on `ALE_AUTH_CONNECT` and `ALE_RESOURCE_ASSIGNMENT` | Block direct ping-style network reachability checks from the offline account. | | DNS | 2 filters: remote port `53` on `ALE_AUTH_CONNECT_V4/V6` | Block direct DNS queries that bypass our intended proxy/offline path. | | DNS-over-TLS | 2 filters: remote port `853` on `ALE_AUTH_CONNECT_V4/V6` | Block encrypted DNS attempts that could bypass ordinary DNS interception. | | SMB / NetBIOS | 4 filters: remote ports `445` and `139` on `ALE_AUTH_CONNECT_V4/V6` | Block Windows file-sharing/network share traffic from sandboxed processes. | For IPv4/IPv6 coverage, the port-based filters are installed on both `ALE_AUTH_CONNECT_V4` and `ALE_AUTH_CONNECT_V6`. ICMP also gets both connect-layer and resource-assignment-layer coverage because ICMP traffic is shaped differently from ordinary TCP/UDP port traffic. ## Validation - `cargo fmt -p codex-windows-sandbox` (completed with existing stable-rustfmt warnings about `imports_granularity = Item`) - `cargo test -p codex-windows-sandbox wfp::tests` - `cargo test -p codex-windows-sandbox` (fails in existing legacy PowerShell sandbox tests because `Microsoft.PowerShell.Utility` could not be loaded; WFP tests passed before that failure)
iceweasel-oai ·
2026-04-30 12:39:01 -07:00 -
[codex] Add token usage to turn tracing spans (#19432)
## Why Slow Codex turns are easier to debug when token usage is visible in the trace itself, without joining against separate analytics. This adds token usage to existing turn-handling spans for regular user turns only. [Example turn](https://openai.datadoghq.com/apm/trace/9d353efa2cb5de1f4c5b93dc33c3df04?colorBy=service&graphType=flamegraph&shouldShowLegend=true&sort=time&spanID=3555541504891512675&spanViewType=metadata&traceQuery=) <img width="1447" height="967" alt="Screenshot 2026-04-24 at 3 03 07 PM" src="https://github.com/user-attachments/assets/ab7bb187-e7fc-41f0-a366-6c44610b2b2c" /> ## What Changed Added response-level token fields on completed handle_responses spans: gen_ai.usage.input_tokens gen_ai.usage.cache_read.input_tokens gen_ai.usage.output_tokens codex.usage.reasoning_output_tokens codex.usage.total_tokens Added aggregate token fields on regular turn spans: codex.turn.token_usage.* Added an explicit regular-turn opt-in via SessionTask::records_turn_token_usage_on_span() so this is not coupled to span-name strings. ## Testing - `cargo test -p codex-otel` - `cargo test -p codex-core turn_and_completed_response_spans_record_token_usage` - `just fmt` - `just fix -p codex-core` - `just fix -p codex-otel` - Manual local Electron/app-server smoke test: regular user turn emits the new span fields Known status: `cargo test -p codex-core` was attempted and failed in unrelated existing areas: config approvals, request-permissions, git-info ordering, and subagent metadata persistence.
charley-openai ·
2026-04-28 11:41:32 -07:00 -
Remove ghost snapshots (#19481)
## Summary - Remove `ghost_snapshot` / `GhostCommit` from the Responses API surface and generated SDK/schema artifacts. - Keep legacy config loading compatible, but make undo a no-op that reports the feature is unavailable. - Clean up core history, compaction, telemetry, rollout, and tests to stop carrying ghost snapshot items. ## Testing - Unit tests passed for `codex-protocol`, `codex-core` targeted undo and compaction flows, `codex-rollout`, and `codex-app-server-protocol`. - Regenerated config and app-server schemas plus Python SDK artifacts and verified they match the checked-in outputs.
pakrym-oai ·
2026-04-27 18:48:57 -07:00 -
Add safety check notification and error handling (#19055)
Adds a new app-server notification that fires when a user account has been flagged for potential safety reasons.
Eric Traut ·
2026-04-22 22:24:12 -07:00 -
feat: Fairly trim skill descriptions within context budget (#18925)
Preserve skill name/path entries whenever possible and trim descriptions first, using round-robin character allocation so short descriptions do not waste budget.
xl-openai ·
2026-04-22 12:33:29 -07:00 -
feat: add explicit AgentIdentity auth mode (#18785)
## Summary This PR adds `CodexAuth::AgentIdentity` as an explicit auth mode. An AgentIdentity auth record is a standalone `auth.json` mode. When `AuthManager::auth().await` loads that mode, it registers one process-scoped task and stores it in runtime-only state on the auth value. Header creation stays synchronous after that because the task is initialized before callers receive the auth object. This PR also removes the old feature flag path. AgentIdentity is selected by explicit auth mode, not by a hidden flag or lazy mutation of ChatGPT auth records. Reference old stack: https://github.com/openai/codex/pull/17387/changes ## Design Decisions - AgentIdentity is a real auth enum variant because it can be the only credential in `auth.json`. - The process task is ephemeral runtime state. It is not serialized and is not stored in rollout/session data. - Account/user metadata needed by existing Codex backend checks lives on the AgentIdentity record for now. - `is_chatgpt_auth()` remains token-specific. - `uses_codex_backend()` is the broader predicate for ChatGPT-token auth and AgentIdentity auth. ## Stack 1. https://github.com/openai/codex/pull/18757: full revert 2. https://github.com/openai/codex/pull/18871: isolated Agent Identity crate 3. This PR: explicit AgentIdentity auth mode and startup task allocation 4. https://github.com/openai/codex/pull/18811: migrate Codex backend auth callsites through AuthProvider 5. https://github.com/openai/codex/pull/18904: accept AgentIdentity JWTs and load `CODEX_AGENT_IDENTITY` ## Testing Tests: targeted Rust checks, cargo-shear, Bazel lock check, and CI.
efrazer-oai ·
2026-04-21 22:33:24 -07:00 -
feat: add metric to track the number of turns with memory usage (#18662)
Add a metric `codex.turn.memory` to know if a turn used memories or not. This is not part of the other turn metrics as a label to limit cardinality
jif-oai ·
2026-04-20 14:31:22 +01:00 -
feat: Budget skill metadata and surface trimming as a warning (#18298)
Cap the model-visible skills section to a small share of the context window, with a fallback character budget, and keep only as many implicit skills as fit within that budget. Emit a non-fatal warning when enabled skills are omitted, and add a new app-server warning notification Record thread-start skill metrics for total enabled skills, kept skills, and whether truncation happened --------- Co-authored-by: Matthew Zeng <mzeng@openai.com> Co-authored-by: Codex <noreply@openai.com>
xl-openai ·
2026-04-17 18:11:47 -07:00 -
Stream apply_patch changes (#17862)
Adds new events for streaming apply_patch changes from responses api. This is to enable clients to show progress during file writes. Caveat: This does not work with apply_patch in function call mode, since that required adding streaming json parsing.
Akshay Nathan ·
2026-04-16 18:12:19 -07:00 -
Add OTEL metrics for hook runs (#18026)
# Why We already emit analytics for completed hook runs, but we don't have matching OTEL metrics to track hook volume and latency. # What - add `codex.hooks.run` and `codex.hooks.run.duration_ms` - tag both metrics with `hook_name`, `source`, and `status` - emit the metrics from the completed hook path Verified locally against a dummy OTLP collector --------- Co-authored-by: Codex <noreply@openai.com>
Abhinav ·
2026-04-16 21:30:38 +00:00 -
[codex] reduce module visibility (#16978)
## Summary - reduce public module visibility across Rust crates, preferring private or crate-private modules with explicit crate-root public exports - update external call sites and tests to use the intended public crate APIs instead of reaching through module trees - add the module visibility guideline to AGENTS.md ## Validation - `cargo check --workspace --all-targets --message-format=short` passed before the final fix/format pass - `just fix` completed successfully - `just fmt` completed successfully - `git diff --check` passed
pakrym-oai ·
2026-04-07 08:03:35 -07:00 -
Codex/windows bazel rust test coverage no rs (#16528)
# Why this PR exists This PR is trying to fix a coverage gap in the Windows Bazel Rust test lane. Before this change, the Windows `bazel test //...` job was nominally part of PR CI, but a non-trivial set of `//codex-rs/...` Rust test targets did not actually contribute test signal on Windows. In particular, targets such as `//codex-rs/core:core-unit-tests`, `//codex-rs/core:core-all-test`, and `//codex-rs/login:login-unit-tests` were incompatible during Bazel analysis on the Windows gnullvm platform, so they never reached test execution there. That is why the Cargo-powered Windows CI job could surface Windows-only failures that the Bazel-powered job did not report: Cargo was executing those tests, while Bazel was silently dropping them from the runnable target set. The main goal of this PR is to make the Windows Bazel test lane execute those Rust test targets instead of skipping them during analysis, while still preserving `windows-gnullvm` as the target configuration for the code under test. In other words: use an MSVC host/exec toolchain where Bazel helper binaries and build scripts need it, but continue compiling the actual crate targets with the Windows gnullvm cfgs that our current Bazel matrix is supposed to exercise. # Important scope note This branch intentionally removes the non-resource-loading `.rs` test and production-code changes from the earlier `codex/windows-bazel-rust-test-coverage` branch. The only Rust source changes kept here are runfiles/resource-loading fixes in TUI tests: - `codex-rs/tui/src/chatwidget/tests.rs` - `codex-rs/tui/tests/manager_dependency_regression.rs` That is deliberate. Since the corresponding tests already pass under Cargo, this PR is meant to test whether Bazel infrastructure/toolchain fixes alone are enough to get a healthy Windows Bazel test signal, without changing test behavior for Windows timing, shell output, or SQLite file-locking. # How this PR changes the Windows Bazel setup ## 1. Split Windows host/exec and target concerns in the Bazel test lane The core change is that the Windows Bazel test job now opts into an MSVC host platform for Bazel execution-time tools, but only for `bazel test`, not for the Bazel clippy build. Files: - `.github/workflows/bazel.yml` - `.github/scripts/run-bazel-ci.sh` - `MODULE.bazel` What changed: - `run-bazel-ci.sh` now accepts `--windows-msvc-host-platform`. - When that flag is present on Windows, the wrapper appends `--host_platform=//:local_windows_msvc` unless the caller already provided an explicit `--host_platform`. - `bazel.yml` passes that wrapper flag only for the Windows `bazel test //...` job. - The Bazel clippy job intentionally does **not** pass that flag, so clippy stays on the default Windows gnullvm host/exec path and continues linting against the target cfgs we care about. - `run-bazel-ci.sh` also now forwards `CODEX_JS_REPL_NODE_PATH` on Windows and normalizes the `node` executable path with `cygpath -w`, so tests that need Node resolve the runner's Node installation correctly under the Windows Bazel test environment. Why this helps: - The original incompatibility chain was mostly on the **exec/tool** side of the graph, not in the Rust test code itself. Moving host tools to MSVC lets Bazel resolve helper binaries and generators that were not viable on the gnullvm exec platform. - Keeping the target platform on gnullvm preserves cfg coverage for the crates under test, which is important because some Windows behavior differs between `msvc` and `gnullvm`. ## 2. Teach the repo's Bazel Rust macro about Windows link flags and integration-test knobs Files: - `defs.bzl` - `codex-rs/core/BUILD.bazel` - `codex-rs/otel/BUILD.bazel` - `codex-rs/tui/BUILD.bazel` What changed: - Replaced the old gnullvm-only linker flag block with `WINDOWS_RUSTC_LINK_FLAGS`, which now handles both Windows ABIs: - gnullvm gets `-C link-arg=-Wl,--stack,8388608` - MSVC gets `-C link-arg=/STACK:8388608`, `-C link-arg=/NODEFAULTLIB:libucrt.lib`, and `-C link-arg=ucrt.lib` - Threaded those Windows link flags into generated `rust_binary`, unit-test binaries, and integration-test binaries. - Extended `codex_rust_crate(...)` with: - `integration_test_args` - `integration_test_timeout` - Used those new knobs to: - mark `//codex-rs/core:core-all-test` as a long-running integration test - serialize `//codex-rs/otel:otel-all-test` with `--test-threads=1` - Added `src/**/*.rs` to `codex-rs/tui` test runfiles, because one regression test scans source files at runtime and Bazel does not expose source-tree directories unless they are declared as data. Why this helps: - Once host-side MSVC tools are available, we still need the generated Rust test binaries to link correctly on Windows. The MSVC-side stack/UCRT flags make those binaries behave more like their Cargo-built equivalents. - The integration-test macro knobs avoid hardcoding one-off test behavior in ad hoc BUILD rules and make the generated test targets more expressive where Bazel and Cargo have different runtime defaults. ## 3. Patch `rules_rs` / `rules_rust` so Windows MSVC exec-side Rust and build scripts are actually usable Files: - `MODULE.bazel` - `patches/rules_rs_windows_exec_linker.patch` - `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch` - `patches/rules_rust_windows_build_script_runner_paths.patch` - `patches/rules_rust_windows_exec_msvc_build_script_env.patch` - `patches/rules_rust_windows_msvc_direct_link_args.patch` - `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch` - `patches/BUILD.bazel` What these patches do: - `rules_rs_windows_exec_linker.patch` - Adds a `rust-lld` filegroup for Windows Rust toolchain repos, symlinked to `lld-link.exe` from `PATH`. - Marks Windows toolchains as using a direct linker driver. - Supplies Windows stdlib link flags for both gnullvm and MSVC. - `rules_rust_windows_bootstrap_process_wrapper_linker.patch` - For Windows MSVC Rust targets, prefers the Rust toolchain linker over an inherited C++ linker path like `clang++`. - This specifically avoids the broken mixed-mode command line where rustc emits MSVC-style `/NOLOGO` / `/LIBPATH:` / `/OUT:` arguments but Bazel still invokes `clang++.exe`. - `rules_rust_windows_build_script_runner_paths.patch` - Normalizes forward-slash execroot-relative paths into Windows path separators before joining them on Windows. - Uses short Windows paths for `RUSTC`, `OUT_DIR`, and the build-script working directory to avoid path-length and quoting issues in third-party build scripts. - Exposes `RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER=1` to build scripts so crate-local patches can detect "this is running under Bazel's build-script runner". - Fixes the Windows runfiles cleanup filter so generated files with retained suffixes are actually retained. - `rules_rust_windows_exec_msvc_build_script_env.patch` - For exec-side Windows MSVC build scripts, stops force-injecting Bazel's `CC`, `CXX`, `LD`, `CFLAGS`, and `CXXFLAGS` when that would send GNU-flavored tool paths/flags into MSVC-oriented Cargo build scripts. - Rewrites or strips GNU-only `--sysroot`, MinGW include/library paths, stack-protector, and `_FORTIFY_SOURCE` flags on the MSVC exec path. - The practical effect is that build scripts can fall back to the Visual Studio toolchain environment already exported by CI instead of crashing inside Bazel's hermetic `clang.exe` setup. - `rules_rust_windows_msvc_direct_link_args.patch` - When using a direct linker on Windows, stops forwarding GNU driver flags such as `-L...` and `--sysroot=...` that `lld-link.exe` does not understand. - Passes non-`.lib` native artifacts as explicit `-Clink-arg=<path>` entries when needed. - Filters C++ runtime libraries to `.lib` artifacts on the Windows direct-driver path. - `rules_rust_windows_process_wrapper_skip_temp_outputs.patch` - Excludes transient `*.tmp*` and `*.rcgu.o` files from process-wrapper dependency search-path consolidation, so unstable compiler outputs do not get treated as real link search-path inputs. Why this helps: - The host-platform split alone was not enough. Once Bazel started analyzing/running previously incompatible Rust tests on Windows, the next failures were in toolchain plumbing: - MSVC-targeted Rust tests were being linked through `clang++` with MSVC-style arguments. - Cargo build scripts running under Bazel's Windows MSVC exec platform were handed Unix/GNU-flavored path and flag shapes. - Some generated paths were too long or had path-separator forms that third-party Windows build scripts did not tolerate. - These patches make that mixed Bazel/Cargo/Rust/MSVC path workable enough for the test lane to actually build and run the affected crates. ## 4. Patch third-party crate build scripts that were not robust under Bazel's Windows MSVC build-script path Files: - `MODULE.bazel` - `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch` - `patches/ring_windows_msvc_include_dirs.patch` - `patches/zstd-sys_windows_msvc_include_dirs.patch` What changed: - `aws-lc-sys` - Detects Bazel's Windows MSVC build-script runner via `RULES_RUST_BAZEL_BUILD_SCRIPT_RUNNER` or a `bazel-out` manifest-dir path. - Uses `clang-cl` for Bazel Windows MSVC builds when no explicit `CC`/`CXX` is set. - Allows prebuilt NASM on the Bazel Windows MSVC path even when `nasm` is not available directly in the runner environment. - Avoids canonicalizing `CARGO_MANIFEST_DIR` in the Bazel Windows MSVC case, because that path may point into Bazel output/runfiles state where preserving the given path is more reliable than forcing a local filesystem canonicalization. - `ring` - Under the Bazel Windows MSVC build-script runner, copies the pregenerated source tree into `OUT_DIR` and uses that as the generated-source root. - Adds include paths needed by MSVC compilation for Fiat/curve25519/P-256 generated headers. - Rewrites a few relative includes in C sources so the added include directories are sufficient. - `zstd-sys` - Adds MSVC-only include directories for `compress`, `decompress`, and feature-gated dictionary/legacy/seekable sources. - Skips `-fvisibility=hidden` on MSVC targets, where that GCC/Clang-style flag is not the right mechanism. Why this helps: - After the `rules_rust` plumbing started running build scripts on the Windows MSVC exec path, some third-party crates still failed for crate-local reasons: wrong compiler choice, missing include directories, build-script assumptions about manifest paths, or Unix-only C compiler flags. - These crate patches address those crate-local assumptions so the larger toolchain change can actually reach first-party Rust test execution. ## 5. Keep the only `.rs` test changes to Bazel/Cargo runfiles parity Files: - `codex-rs/tui/src/chatwidget/tests.rs` - `codex-rs/tui/tests/manager_dependency_regression.rs` What changed: - Instead of asking `find_resource!` for a directory runfile like `src/chatwidget/snapshots` or `src`, these tests now resolve one known file runfile first and then walk to its parent directory. Why this helps: - Bazel runfiles are more reliable for explicitly declared files than for source-tree directories that happen to exist in a Cargo checkout. - This keeps the tests working under both Cargo and Bazel without changing their actual assertions. # What we tried before landing on this shape, and why those attempts did not work ## Attempt 1: Force `--host_platform=//:local_windows_msvc` for all Windows Bazel jobs This did make the previously incompatible test targets show up during analysis, but it also pushed the Bazel clippy job and some unrelated build actions onto the MSVC exec path. Why that was bad: - Windows clippy started running third-party Cargo build scripts with Bazel's MSVC exec settings and crashed in crates such as `tree-sitter` and `libsqlite3-sys`. - That was a regression in a job that was previously giving useful gnullvm-targeted lint signal. What this PR does instead: - The wrapper flag is opt-in, and `bazel.yml` uses it only for the Windows `bazel test` lane. - The clippy lane stays on the default Windows gnullvm host/exec configuration. ## Attempt 2: Broaden the `rules_rust` linker override to all Windows Rust actions This fixed the MSVC test-lane failure where normal `rust_test` targets were linked through `clang++` with MSVC-style arguments, but it broke the default gnullvm path. Why that was bad: - `@@rules_rs++rules_rust+rules_rust//util/process_wrapper:process_wrapper` on the gnullvm exec platform started linking with `lld-link.exe` and then failed to resolve MinGW-style libraries such as `-lkernel32`, `-luser32`, and `-lmingw32`. What this PR does instead: - The linker override is restricted to Windows MSVC targets only. - The gnullvm path keeps its original linker behavior, while MSVC uses the direct Windows linker. ## Attempt 3: Keep everything on pure Windows gnullvm and patch the V8 / Python incompatibility chain instead This would have preserved a single Windows ABI everywhere, but it is a much larger project than this PR. Why that was not the practical first step: - The original incompatibility chain ran through exec-side generators and helper tools, not only through crate code. - `third_party/v8` is already special-cased on Windows gnullvm because `rusty_v8` only publishes Windows prebuilts under MSVC names. - Fixing that path likely means deeper changes in V8/rules_python/rules_rust toolchain resolution and generator execution, not just one local CI flag. What this PR does instead: - Keep gnullvm for the target cfgs we want to exercise. - Move only the Windows test lane's host/exec platform to MSVC, then patch the build-script/linker boundary enough for that split configuration to work. ## Attempt 4: Validate compatibility with `bazel test --nobuild ...` This turned out to be a misleading local validation command. Why: - `bazel test --nobuild ...` can successfully analyze targets and then still exit 1 with "Couldn't start the build. Unable to run tests" because there are no runnable test actions after `--nobuild`. Better local check: ```powershell bazel build --nobuild --keep_going --host_platform=//:local_windows_msvc //codex-rs/login:login-unit-tests //codex-rs/core:core-unit-tests //codex-rs/core:core-all-test ``` # Which patches probably deserve upstream follow-up My rough take is that the `rules_rs` / `rules_rust` patches are the highest-value upstream candidates, because they are fixing generic Windows host/exec + MSVC direct-linker behavior rather than Codex-specific test logic. Strong upstream candidates: - `patches/rules_rs_windows_exec_linker.patch` - `patches/rules_rust_windows_bootstrap_process_wrapper_linker.patch` - `patches/rules_rust_windows_build_script_runner_paths.patch` - `patches/rules_rust_windows_exec_msvc_build_script_env.patch` - `patches/rules_rust_windows_msvc_direct_link_args.patch` - `patches/rules_rust_windows_process_wrapper_skip_temp_outputs.patch` Why these seem upstreamable: - They address general-purpose problems in the Windows MSVC exec path: - missing direct-linker exposure for Rust toolchains - wrong linker selection when rustc emits MSVC-style args - Windows path normalization/short-path issues in the build-script runner - forwarding GNU-flavored CC/link flags into MSVC Cargo build scripts - unstable temp outputs polluting process-wrapper search-path state Potentially upstreamable crate patches, but likely with more care: - `patches/zstd-sys_windows_msvc_include_dirs.patch` - `patches/ring_windows_msvc_include_dirs.patch` - `patches/aws-lc-sys_windows_msvc_prebuilt_nasm.patch` Notes on those: - The `zstd-sys` and `ring` include-path fixes look fairly generic for MSVC/Bazel build-script environments and may be straightforward to propose upstream after we confirm CI stability. - The `aws-lc-sys` patch is useful, but it includes a Bazel-specific environment probe and CI-specific compiler fallback behavior. That probably needs a cleaner upstream-facing shape before sending it out, so upstream maintainers are not forced to adopt Codex's exact CI assumptions. Probably not worth upstreaming as-is: - The repo-local Starlark/test target changes in `defs.bzl`, `codex-rs/*/BUILD.bazel`, and `.github/scripts/run-bazel-ci.sh` are mostly Codex-specific policy and CI wiring, not generic rules changes. # Validation notes for reviewers On this branch, I ran the following local checks after dropping the non-resource-loading Rust edits: ```powershell cargo test -p codex-tui just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc -- fix -p codex-tui python .\tools\argument-comment-lint\run-prebuilt-linter.py -p codex-tui just --shell 'C:\Program Files\Git\bin\bash.exe' --shell-arg -lc fmt ``` One local caveat: - `just argument-comment-lint` still fails on this Windows machine for an unrelated Bazel toolchain-resolution issue in `//codex-rs/exec:exec-all-test`, so I used the direct prebuilt linter for `codex-tui` as the local fallback. # Expected reviewer takeaway If this PR goes green, the important conclusion is that the Windows Bazel test coverage gap was primarily a Bazel host/exec toolchain problem, not a need to make the Rust tests themselves Windows-specific. That would be a strong signal that the deleted non-resource-loading Rust test edits from the earlier branch should stay out, and that future work should focus on upstreaming the generic `rules_rs` / `rules_rust` Windows fixes and reducing the crate-local patch surface.
Michael Bolin ·
2026-04-03 15:34:03 -07:00 -
otel: remove the last workspace crate feature (#16469)
## Why `codex-otel` still carried `disable-default-metrics-exporter`, which was the last remaining workspace crate feature. We are removing workspace crate features because they do not fit our current build model well: - our Bazel setup does not honor crate features today, which can let feature-gated issues go unnoticed - they create extra crate build permutations that we want to avoid For this case, the feature was only being used to keep the built-in Statsig metrics exporter off in test and debug-oriented contexts. This repo already treats `debug_assertions` as the practical proxy for that class of behavior, so OTEL should follow the same convention instead of keeping a dedicated crate feature alive. ## What changed - removed `disable-default-metrics-exporter` from `codex-rs/otel/Cargo.toml` - removed the `codex-otel` dev-dependency feature activation from `codex-rs/core/Cargo.toml` - changed `codex-rs/otel/src/config.rs` so the built-in `OtelExporter::Statsig` default resolves to `None` when `debug_assertions` is enabled, with a focused unit test covering that behavior - removed the final feature exceptions from `.github/scripts/verify_cargo_workspace_manifests.py`, so workspace crate features are now hard-banned instead of temporarily allowlisted - expanded the verifier error message to explain the Bazel mismatch and build-permutation cost behind that policy ## How tested - `python3 .github/scripts/verify_cargo_workspace_manifests.py` - `cargo test -p codex-otel` - `cargo test -p codex-core metrics_exporter_defaults_to_statsig_when_missing` - `cargo test -p codex-app-server app_server_default_analytics_` - `just bazel-lock-check`
Michael Bolin ·
2026-04-01 13:45:23 -07:00 -
chore: clean up argument-comment lint and roll out all-target CI on macOS (#16054)
## Why `argument-comment-lint` was green in CI even though the repo still had many uncommented literal arguments. The main gap was target coverage: the repo wrapper did not force Cargo to inspect test-only call sites, so examples like the `latest_session_lookup_params(true, ...)` tests in `codex-rs/tui_app_server/src/lib.rs` never entered the blocking CI path. This change cleans up the existing backlog, makes the default repo lint path cover all Cargo targets, and starts rolling that stricter CI enforcement out on the platform where it is currently validated. ## What changed - mechanically fixed existing `argument-comment-lint` violations across the `codex-rs` workspace, including tests, examples, and benches - updated `tools/argument-comment-lint/run-prebuilt-linter.sh` and `tools/argument-comment-lint/run.sh` so non-`--fix` runs default to `--all-targets` unless the caller explicitly narrows the target set - fixed both wrappers so forwarded cargo arguments after `--` are preserved with a single separator - documented the new default behavior in `tools/argument-comment-lint/README.md` - updated `rust-ci` so the macOS lint lane keeps the plain wrapper invocation and therefore enforces `--all-targets`, while Linux and Windows temporarily pass `-- --lib --bins` That temporary CI split keeps the stricter all-targets check where it is already cleaned up, while leaving room to finish the remaining Linux- and Windows-specific target-gated cleanup before enabling `--all-targets` on those runners. The Linux and Windows failures on the intermediate revision were caused by the wrapper forwarding bug, not by additional lint findings in those lanes. ## Validation - `bash -n tools/argument-comment-lint/run.sh` - `bash -n tools/argument-comment-lint/run-prebuilt-linter.sh` - shell-level wrapper forwarding check for `-- --lib --bins` - shell-level wrapper forwarding check for `-- --tests` - `just argument-comment-lint` - `cargo test` in `tools/argument-comment-lint` - `cargo test -p codex-terminal-detection` ## Follow-up - Clean up remaining Linux-only target-gated callsites, then switch the Linux lint lane back to the plain wrapper invocation. - Clean up remaining Windows-only target-gated callsites, then switch the Windows lint lane back to the plain wrapper invocation.
Michael Bolin ·
2026-03-27 19:00:44 -07:00