From 6bb2fa3fd4fa9dab81582ce20dac064cd650bf9f Mon Sep 17 00:00:00 2001 From: Konstantine Kahadze Date: Fri, 24 Apr 2026 11:26:47 -0700 Subject: [PATCH] Update bundled OpenAI Docs skill for GPT-5.5 (#19407) ## Summary Updates the bundled OpenAI Docs system skill for GPT-5.5. ## Changes - Updates the bundled latest-model fallback - Replaces bundled upgrade guidance with GPT-5.5 migration guidance - Replaces bundled prompting guidance with GPT-5.5 prompting guidance ## Test plan - Ran `node scripts/resolve-latest-model-info.js` - Verified bundled files match the OpenAI Docs skill fallback content --- .../src/assets/samples/openai-docs/SKILL.md | 4 +- .../samples/openai-docs/agents/openai.yaml | 4 +- .../openai-docs/references/latest-model.md | 12 +- .../openai-docs/references/prompting-guide.md | 701 +++++------------- .../openai-docs/references/upgrade-guide.md | 113 +-- .../scripts/resolve-latest-model-info.js | 2 +- 6 files changed, 248 insertions(+), 588 deletions(-) diff --git a/codex-rs/skills/src/assets/samples/openai-docs/SKILL.md b/codex-rs/skills/src/assets/samples/openai-docs/SKILL.md index 6d9dbc38d..eb12887b7 100644 --- a/codex-rs/skills/src/assets/samples/openai-docs/SKILL.md +++ b/codex-rs/skills/src/assets/samples/openai-docs/SKILL.md @@ -6,7 +6,7 @@ description: "Use when the user asks how to build with OpenAI products or APIs a # OpenAI Docs -Provide authoritative, current guidance from OpenAI developer docs using the developers.openai.com MCP server. Always prioritize the developer docs MCP tools over web.run for OpenAI-related questions. This skill may also load targeted files from `references/` for model-selection, model-upgrade, and prompt-upgrade requests, but current OpenAI docs remain authoritative. Only if the MCP server is installed and returns no meaningful results should you fall back to web search. +Provide authoritative, current guidance from OpenAI developer docs using the developers.openai.com MCP server. Always prioritize the developer docs MCP tools over web.run for OpenAI-related questions. This skill also owns model selection, API model migration, and prompt-upgrade guidance. Only if the MCP server is installed and returns no meaningful results should you fall back to web search. ## Quick start @@ -14,7 +14,7 @@ Provide authoritative, current guidance from OpenAI developer docs using the dev - Use `mcp__openaiDeveloperDocs__fetch_openai_doc` to pull exact sections and quote/paraphrase accurately. - Use `mcp__openaiDeveloperDocs__list_openai_docs` only when you need to browse or discover pages without a clear query. - For model-selection, "latest model", or default-model questions, fetch `https://developers.openai.com/api/docs/guides/latest-model.md` first. If that is unavailable, load `references/latest-model.md`. -- For model upgrades or prompt upgrades, run `node scripts/resolve-latest-model-info.js` from this skill directory when the script is present, then follow `references/upgrade-guide.md` unless the resolver returns newer guidance for a dynamic latest/current/default request. +- For model upgrades or prompt upgrades, run `node scripts/resolve-latest-model-info.js` only when the target is latest/current/default or otherwise unspecified; otherwise preserve the explicitly requested target. - Preserve explicit target requests: if the user names a target model like "migrate to GPT-5.4", keep that requested target even if `latest-model.md` names a newer model. Mention newer guidance only as optional. - If current remote guidance is needed, fetch both the returned migration and prompting guide URLs directly. If direct fetch fails, use MCP/search fallback; if that also fails, use bundled fallback references and disclose the fallback. diff --git a/codex-rs/skills/src/assets/samples/openai-docs/agents/openai.yaml b/codex-rs/skills/src/assets/samples/openai-docs/agents/openai.yaml index d72b601cb..d056abcad 100644 --- a/codex-rs/skills/src/assets/samples/openai-docs/agents/openai.yaml +++ b/codex-rs/skills/src/assets/samples/openai-docs/agents/openai.yaml @@ -1,9 +1,9 @@ interface: display_name: "OpenAI Docs" - short_description: "Reference official OpenAI docs, including upgrade guidance" + short_description: "Reference docs, choose models, and migrate OpenAI API integrations" icon_small: "./assets/openai-small.svg" icon_large: "./assets/openai.png" - default_prompt: "Look up official OpenAI docs, load relevant GPT-5.4 upgrade references when applicable, and answer with concise, cited guidance." + default_prompt: "Use OpenAI Docs for official docs lookup, model selection, model migration, and prompt-upgrade work." dependencies: tools: diff --git a/codex-rs/skills/src/assets/samples/openai-docs/references/latest-model.md b/codex-rs/skills/src/assets/samples/openai-docs/references/latest-model.md index 23f5cd16f..04aa84bad 100644 --- a/codex-rs/skills/src/assets/samples/openai-docs/references/latest-model.md +++ b/codex-rs/skills/src/assets/samples/openai-docs/references/latest-model.md @@ -6,10 +6,16 @@ This file is a curated helper. Every recommendation here must be verified agains | Model ID | Use for | | --- | --- | -| `gpt-5.4` | Default text plus reasoning for most new apps, including for coding use-cases | -| `gpt-5.4-pro` | Only when the user explicitly asks for maximum reasoning or quality; substantially slower and more expensive | -| `gpt-5.4-mini` | Cheaper and faster reasoning with good quality, including for coding use-cases | +| `gpt-5.5` | Latest/default text and reasoning model for most new apps, including coding and tool-heavy workflows | +| `gpt-5.5-pro` | Maximum reasoning or quality when latency and cost matter less | +| `gpt-5.4` | Previous default text and reasoning model; use for existing GPT-5.4 integrations | +| `gpt-5.4-mini` | Lower-cost testing and lighter production workflows | | `gpt-5.4-nano` | High-throughput simple tasks and classification | +| `gpt-5.5` | Explicit no-reasoning text path via `reasoning.effort: none` | +| `gpt-4.1-mini` | Cheaper no-reasoning text | +| `gpt-4.1-nano` | Fastest and cheapest no-reasoning text | +| `gpt-5.3-codex` | Agentic coding, code editing, and tool-heavy coding workflows | +| `gpt-5.1-codex-mini` | Cheaper coding workflows | | `gpt-image-1.5` | Best image generation and edit quality | | `gpt-image-1-mini` | Cost-optimized image generation | | `gpt-4o-mini-tts` | Text-to-speech | diff --git a/codex-rs/skills/src/assets/samples/openai-docs/references/prompting-guide.md b/codex-rs/skills/src/assets/samples/openai-docs/references/prompting-guide.md index 72f490993..0d9273cec 100644 --- a/codex-rs/skills/src/assets/samples/openai-docs/references/prompting-guide.md +++ b/codex-rs/skills/src/assets/samples/openai-docs/references/prompting-guide.md @@ -1,599 +1,244 @@ -# Prompt guidance for GPT-5.4 +GPT-5.5 works best when prompts define the outcome and leave room for the model to choose an efficient solution path. Compared with earlier models, you can often use shorter, more outcome-oriented prompts: describe what good looks like, what constraints matter, what evidence is available, and what the final answer should contain. -GPT-5.4, our newest mainline model, is designed to balance long-running task performance, stronger control over style and behavior, and more disciplined execution across complex workflows. Building on advances from GPT-5 through GPT-5.3-Codex, GPT-5.4 improves token efficiency, sustains multi-step workflows more reliably, and performs well on long-horizon tasks. +Avoid carrying over every instruction from an older prompt stack. Legacy prompts often over-specify the process because earlier models needed more help staying on track. With GPT-5.5, that can add noise, narrow the model's search space, or lead to overly mechanical answers. -GPT-5.4 is designed for production-grade assistants and agents that need strong multi-step reasoning, evidence-rich synthesis, and reliable performance over long contexts. It is especially effective when prompts clearly specify the output contract, tool-use expectations, and completion criteria. In practice, the biggest gains come from choosing the right reasoning effort for the task, using explicit grounding and citation rules, and giving the model a precise definition of what "done" looks like. This guide focuses on prompt patterns and migration practices that preserve those efficiency wins. For model capabilities, API parameters, and broader migration guidance, see [our latest model guide](https://developers.openai.com/api/docs/guides/latest-model). +For more detail on GPT-5.5 behavior changes, start with the [Using GPT-5.5 guide](/api/docs/guides/latest-model). This guide focuses on prompt changes that follow from those behavior changes. -When troubleshooting cases where GPT-5.4 treats an intermediate update as the - final answer, verify your integration preserves the assistant message `phase` - field correctly. See [Phase parameter](#phase-parameter) for details. +The patterns here are starting points. Adapt them to your product surface, tools, evals, and user experience goals. -## Understand GPT-5.4 behavior +## Personality and behavior -### Where GPT-5.4 is strongest +GPT-5.5's default style is efficient, direct, and task-oriented. This is useful for production systems: responses stay focused, behavior is easier to steer, and the model avoids unnecessary conversational padding. -GPT-5.4 tends to work especially well in these areas: +For customer-facing assistants, support workflows, coaching experiences, and other conversational products, define both personality and collaboration style. -- Strong personality and tone adherence, with less drift over long answers -- Agentic workflow robustness, with a stronger tendency to stick with multi-step work, retry, and complete agent loops end to end -- Evidence-rich synthesis, especially in long-context or multi-tool workflows -- Instruction adherence in modular, skill-based, and block-structured prompts when the contract is explicit -- Long-context analysis across large, messy, or multi-document inputs -- Batched or parallel tool calling while maintaining tool-call accuracy -- Spreadsheet, finance, and Excel workflows that need instruction following, formatting fidelity, and stronger self-verification +- **Personality** controls how the assistant sounds: tone, warmth, directness, formality, humor, empathy, and level of polish. +- **Collaboration style** controls how the assistant works: when it asks questions, when it makes assumptions, how proactive it should be, how much context it gives, when it checks work, and how it handles uncertainty or risk. -### Where explicit prompting still helps +Keep both short. Personality instructions should shape the user experience. Collaboration instructions should shape task behavior. Neither should replace clear goals, success criteria, tool rules, or stopping conditions. -Even with those strengths, GPT-5.4 benefits from more explicit guidance in a few recurring patterns: - -- Low-context tool routing early in a session, when tool selection can be less reliable -- Dependency-aware workflows that need explicit prerequisite and downstream-step checks -- Reasoning effort selection, where higher effort is not always better and the right choice depends on task shape, not intuition -- Research tasks that require disciplined source collection and consistent citations -- Irreversible or high-impact actions that require verification before execution -- Terminal or coding-agent environments where tool boundaries must stay clear - -These patterns are observed defaults, not guarantees. Start with the smallest prompt that passes your evals, and add blocks only when they fix a measured failure mode. - -## Use core prompt patterns - -### Keep outputs compact and structured - -To improve token efficiency with GPT-5.4, constrain verbosity and enforce structured output through clear output contracts. In practice, this acts as an additional control layer alongside the `verbosity` parameter in the Responses API, allowing you to guide both how much the model writes and how it structures the output. - -```xml - -- Return exactly the sections requested, in the requested order. -- If the prompt defines a preamble, analysis block, or working section, do not treat it as extra output. -- Apply length limits only to the section they are intended for. -- If a format is required (JSON, Markdown, SQL, XML), output only that format. - - - -- Prefer concise, information-dense writing. -- Avoid repeating the user's request. -- Keep progress updates brief. -- Do not shorten the answer so aggressively that required evidence, reasoning, or completion checks are omitted. - -``` - -### Set clear defaults for follow-through - -Users often change the task, format, or tone mid-conversation. To keep the assistant aligned, define clear rules for when to proceed, when to ask, and how newer instructions override earlier defaults. - -Use a default follow-through policy like this: - -```xml - -- If the user’s intent is clear and the next step is reversible and low-risk, proceed without asking. -- Ask permission only if the next step is: - (a) irreversible, - (b) has external side effects (for example sending, purchasing, deleting, or writing to production), or - (c) requires missing sensitive information or a choice that would materially change the outcome. -- If proceeding, briefly state what you did and what remains optional. - -``` - -Make instruction priority explicit: - -```xml - -- User instructions override default style, tone, formatting, and initiative preferences. -- Safety, honesty, privacy, and permission constraints do not yield. -- If a newer user instruction conflicts with an earlier one, follow the newer instruction. -- Preserve earlier instructions that do not conflict. - -``` - -Higher-priority developer or system instructions remain binding. - -**Guidance:** When instructions change mid-conversation, make the update explicit, scoped, and local. State what changed, what still applies, and whether the change affects the next turn or the rest of the conversation. - -### Handle mid-conversation instruction updates - -For mid-conversation updates, use explicit, scoped steering messages that state: - -1. Scope -2. Override -3. Carry forward +Example personality block for a steady task-focused assistant: ```text - -For the next response only: -- Do not complete the task. -- Only produce a plan. -- Keep it to 5 bullets. +# Personality +You are a capable collaborator: approachable, steady, and direct. Assume the user is competent and acting in good faith, and respond with patience, respect, and practical helpfulness. -All earlier instructions still apply unless they conflict with this update. - +Prefer making progress over stopping for clarification when the request is already clear enough to attempt. Use context and reasonable assumptions to move forward. Ask for clarification only when the missing information would materially change the answer or create meaningful risk, and keep any question narrow. + +Stay concise without becoming curt. Give enough context for the user to understand and trust the answer, then stop. Use examples, comparisons, or simple analogies when they make the point easier to grasp. When correcting the user or disagreeing, be candid but constructive. When an error is pointed out, acknowledge it plainly and focus on fixing it. + +Match the user's tone within professional bounds. Avoid emojis and profanity by default, unless the user explicitly asks for that style or has clearly established it as appropriate for the conversation. ``` -If the task itself changes, say so directly: +Example personality block for an expressive collaborative assistant: ```text - -The task has changed. -Previous task: complete the workflow. -Current task: review the workflow and identify risks only. +# Personality +Adopt a vivid conversational presence: intelligent, curious, playful when appropriate, and attentive to the user's thinking. Ask good questions when the problem is blurry, then become decisive once there is enough context. -Rules for this turn: -- Do not execute actions. -- Do not call destructive tools. -- Return exactly: - 1. Main risks - 2. Missing information - 3. Recommended next step - +Be warm, collaborative, and polished. Conversation should feel easy and alive, but not chatty for its own sake. Offer a real point of view rather than merely mirroring the user, while staying responsive to their goals and constraints. + +Be thoughtful and grounded when the task calls for synthesis or advice. State a clear recommendation when you have enough context, explain important tradeoffs, and name uncertainty without becoming evasive. ``` -### Make tool use persistent when correctness depends on it +For more expressive products, add warmth, curiosity, humor, or point of view explicitly, but keep the block short. Use personality to shape the experience, not to compensate for unclear goals or missing task instructions. -Use explicit rules to keep tool use thorough, dependency-aware, and appropriately paced, especially in workflows where later actions rely on earlier retrieval or verification. A common failure mode is skipping prerequisites because the right end state seems obvious. +## Improve time to first visible token with a preamble -GPT-5.4 can be less reliable at tool routing early in a session, when context is still thin. Prompt for prerequisites, dependency checks, and exact tool intent. +In streaming applications, users notice how long it takes before the first visible response appears. GPT-5.5 may spend time reasoning, planning, or preparing tool calls before emitting visible text. -```xml - -- Use tools whenever they materially improve correctness, completeness, or grounding. -- Do not stop early when another tool call is likely to materially improve correctness or completeness. -- Keep calling tools until: - (1) the task is complete, and - (2) verification passes (see ). -- If a tool returns empty or partial results, retry with a different strategy. - -``` +For longer or tool-heavy tasks, prompt the model to start with a short preamble: a brief visible update that acknowledges the request and states the first step. This can improve perceived responsiveness without changing the underlying task. -This is especially important for workflows where the final action depends on earlier lookup or retrieval steps. One of the most common failure modes is skipping prerequisites because the intended end state seems obvious. - -```xml - -- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval steps are required. -- Do not skip prerequisite steps just because the intended final action seems obvious. -- If the task depends on the output of a prior step, resolve that dependency first. - -``` - -Prompt for parallelism when the work is independent and wall-clock matters. Prompt for sequencing when dependencies, ambiguity, or irreversible actions matter more than speed. - -```xml - -- When multiple retrieval or lookup steps are independent, prefer parallel tool calls to reduce wall-clock time. -- Do not parallelize steps that have prerequisite dependencies or where one result determines the next action. -- After parallel retrieval, pause to synthesize the results before making more calls. -- Prefer selective parallelism: parallelize independent evidence gathering, not speculative or redundant tool use. - -``` - -### Force completeness on long-horizon tasks - -For multi-step workflows, a common failure mode is incomplete execution: the model finishes after partial coverage, misses items in a batch, or treats empty or narrow retrieval as final. GPT-5.4 becomes more reliable when the prompt defines explicit completion rules and recovery behavior. - -Coverage can be achieved through sequential or parallel retrieval, but completion rules should remain explicit either way. - -```xml - -- Treat the task as incomplete until all requested items are covered or explicitly marked [blocked]. -- Keep an internal checklist of required deliverables. -- For lists, batches, or paginated results: - - determine expected scope when possible, - - track processed items or pages, - - confirm coverage before finalizing. -- If any item is blocked by missing data, mark it [blocked] and state exactly what is missing. - -``` - -For workflows where empty, partial, or noisy retrieval is common: - -```xml - -If a lookup returns empty, partial, or suspiciously narrow results: -- do not immediately conclude that no results exist, -- try at least one or two fallback strategies, - such as: - - alternate query wording, - - broader filters, - - a prerequisite lookup, - - or an alternate source or tool, -- Only then report that no results were found, along with what you tried. - -``` - -### Add a verification loop before high-impact actions - -Once the workflow appears complete, add a lightweight verification step before returning the answer or taking an irreversible action. This helps catch requirement misses, grounding issues, and format drift before commit. - -```xml - -Before finalizing: -- Check correctness: does the output satisfy every requirement? -- Check grounding: are factual claims backed by the provided context or tool outputs? -- Check formatting: does the output match the requested schema or style? -- Check safety and irreversibility: if the next step has external side effects, ask permission first. - -``` - -```xml - -- If required context is missing, do NOT guess. -- Prefer the appropriate lookup tool when the missing context is retrievable; ask a minimal clarifying question only when it is not. -- If you must proceed, label assumptions explicitly and choose a reversible action. - -``` - -For agents that actively take actions, add a short execution frame: - -```xml - -- Pre-flight: summarize the intended action and parameters in 1-2 lines. -- Execute via tool. -- Post-flight: confirm the outcome and any validation that was performed. - -``` - -## Handle specialized workflows - -### Choose image detail explicitly for vision and computer use - -If your workflow depends on visual precision, specify the image `detail` level in the prompt or integration instead of relying on `auto`. Use `high` for standard high-fidelity image understanding. Use `original` for large, dense, or spatially sensitive images, especially [computer use, localization, OCR, and click-accuracy tasks](https://developers.openai.com/api/docs/guides/tools-computer-use) on `gpt-5.4` and future models. Use `low` only when speed and cost matter more than fine detail. For more details on image detail levels, see the [Images and Vision guide](https://developers.openai.com/api/docs/guides/images-vision). - -### Lock research and citations to retrieved evidence - -When citation quality matters, make both the source boundary and the format requirement explicit. This helps reduce fabricated references, unsupported claims, and citation-format drift. - -```xml - -- Only cite sources retrieved in the current workflow. -- Never fabricate citations, URLs, IDs, or quote spans. -- Use exactly the citation format required by the host application. -- Attach citations to the specific claims they support, not only at the end. - -``` - -```xml - -- Base claims only on provided context or tool outputs. -- If sources conflict, state the conflict explicitly and attribute each side. -- If the context is insufficient or irrelevant, narrow the answer or say you cannot support the claim. -- If a statement is an inference rather than a directly supported fact, label it as an inference. - -``` - -If your application requires inline citations, require inline citations. If it requires footnotes, require footnotes. The key is to lock the format and prevent the model from improvising unsupported references. - -### Research mode - -Push GPT-5.4 into a disciplined research mode. Use this pattern for research, review, and synthesis tasks. Do not force it onto short execution tasks or simple deterministic transforms. - -```xml - -- Do research in 3 passes: - 1) Plan: list 3-6 sub-questions to answer. - 2) Retrieve: search each sub-question and follow 1-2 second-order leads. - 3) Synthesize: resolve contradictions and write the final answer with citations. -- Stop only when more searching is unlikely to change the conclusion. - -``` - -If your host environment uses a specific research tool or requires a submit step, combine this with the host's finalization contract. - -### Clamp strict output formats - -For SQL, JSON, or other parse-sensitive outputs, tell GPT-5.4 to emit only the target format and check it before finishing. +Use this pattern when the task may take more than one step, require tool calls, or involve a long-running agent workflow. ```text - -- Output only the requested format. -- Do not add prose or markdown fences unless they were requested. -- Validate that parentheses and brackets are balanced. -- Do not invent tables or fields. -- If required schema information is missing, ask for it or return an explicit error object. - +Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step. Keep it to one or two sentences. ``` -If you are extracting document regions or OCR boxes, define the coordinate system and add a drift check: +For coding agents that expose separate message phases, you can be more explicit: ```text - -- Use the specified coordinate format exactly, such as [x1,y1,x2,y2] normalized to 0..1. -- For each box, include page, label, text snippet, and confidence. -- Add a vertical-drift sanity check so boxes stay aligned with the correct line of text. -- If the layout is dense, process page by page and do a second pass for missed items. - +You must always start with an intermediary update before any content in the analysis channel if the task will require calling tools. The user update should acknowledge the request and explain your first step. ``` -### Keep tool boundaries explicit in coding and terminal agents +## Outcome-first prompts and stopping conditions -In coding agents, GPT-5.4 works better when the rules for shell access and file editing are unambiguous. This is especially important when you expose tools like [Shell](https://developers.openai.com/api/docs/guides/tools-shell) or [Apply patch](https://developers.openai.com/api/docs/guides/tools-apply-patch). +GPT-5.5 is strongest when the prompt defines the target outcome, success criteria, constraints, and available context, then lets the model choose the path. -### User updates +For many tasks, describe the destination rather than every step. This gives the model room to choose the right search, tool, or reasoning strategy for the task. -GPT-5.4 does well with brief, outcome-based updates. Reuse the user-updates pattern from the 5.2 guide, but pair it with explicit completion and verification requirements. +Prefer this: -Recommended update spec: +```text +Resolve the customer's issue end to end. -```xml - -- Only update the user when starting a new major phase or when something changes the plan. -- Each update: 1 sentence on outcome + 1 sentence on next step. -- Do not narrate routine tool calls. -- Keep the user-facing status short; keep the work exhaustive. - +Success means: +- the eligibility decision is made from the available policy and account data +- any allowed action is completed before responding +- the final answer includes completed_actions, customer_message, and blockers +- if evidence is missing, ask for the smallest missing field ``` -For coding agents, see the Prompting patterns for coding tasks section below for more specific guidance. +**Avoid unnecessary absolute rules.** Older prompts often use strict instructions like `ALWAYS`, `NEVER`, `must`, and `only` to control model behavior. Use those words for true invariants, such as safety rules, required output fields, or actions that should never happen. For judgment calls, such as when to search, ask for clarification, use a tool, or keep iterating, prefer decision rules instead. -### Prompting patterns for coding tasks +Avoid this style of instruction unless every step is truly required: -**Autonomy and persistence** - -GPT-5.4 is generally more thorough end to end than earlier mainline models on coding and tool-use tasks, so you often need less explicit "verify everything" prompting. Still, for high-stakes changes such as production, migrations, or security work, keep a lightweight verification clause. - -```xml - -Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you. - -Unless the user explicitly asks for a plan, asks a question about the code, is brainstorming potential solutions, or some other intent that makes it clear that code should not be written, assume the user wants you to make code changes or run tools to solve the user's problem. In these cases, it's bad to output your proposed solution in a message, you should go ahead and actually implement the change. If you encounter challenges or blockers, you should attempt to resolve them yourself. - +```text +First inspect A, then inspect B, then compare every field, then think through +all possible exceptions, then decide which tool to call, then call the tool, +then explain the entire process to the user. ``` -**Intermediary updates** +Add explicit stopping conditions: -Keep updates sparse and high-signal. In coding tasks, prefer updates at key points. +```text +Resolve the user query in the fewest useful tool loops, but do not let loop minimization outrank correctness, accessible fallback evidence, calculations, or required citation tags for factual claims. -```xml - -- Intermediary updates go to the `commentary` channel. -- User updates are short updates while you are working. They are not final answers. -- Use 1-2 sentence updates to communicate progress and new information while you work. -- Do not begin responses with conversational interjections or meta commentary. Avoid openers such as acknowledgements ("Done -", "Got it", or "Great question") or similar framing. -- Before exploring or doing substantial work, send a user update explaining your understanding of the request and your first step. Avoid commenting on the request or starting with phrases such as "Got it" or "Understood." -- Provide updates roughly every 30 seconds while working. -- When exploring, explain what context you are gathering and what you learned. Vary sentence structure so the updates do not become repetitive. -- When working for a while, keep updates informative and varied, but stay concise. -- When work is substantial, provide a longer plan after you have enough context. This is the only update that may be longer than 2 sentences and may contain formatting. -- Before file edits, explain what you are about to change. -- While thinking, keep the user informed of progress without narrating every tool call. Even if you are not taking actions, send frequent progress updates rather than going silent, especially if you are thinking for more than a short stretch. -- Keep the tone of progress updates consistent with the assistant's overall personality. - +After each result, ask: "Can I answer the user's core request now with useful evidence and citations for the factual claims?" If yes, answer. ``` -**Formatting** +Define missing-evidence behavior: -GPT-5.4 often defaults to more structured formatting and may overuse bullet lists. If you want a clean final response, explicitly clamp list shape. - -```xml -Never use nested bullets. Keep lists flat (single level). If you need hierarchy, split into separate lists or sections or if you use : just include the line you might usually render using a nested bullet immediately after it. For numbered lists, only use the `1. 2. 3.` style markers (with a period), never `1)`. +```text +Use the minimum evidence sufficient to answer correctly, cite it precisely, then stop. ``` -**Frontend tasks** +## Formatting -Use this only when additional frontend guidance is useful. +GPT-5.5 is highly steerable on output format and structure. Use that control when it improves comprehension or product fit. -```xml - -When doing frontend design tasks, avoid generic, overbuilt layouts. +Set `text.verbosity`, describe the expected output shape, and reserve heavier structure for cases where it improves comprehension or your product UI needs a stable artifact. The API default for `text.verbosity` is `medium`; use `low` when you prefer shorter, more concise responses. -Use these hard rules: -- One composition: The first viewport must read as one composition, not a dashboard, unless it is a dashboard. -- Brand first: On branded pages, the brand or product name must be a hero-level signal, not just nav text or an eyebrow. No headline should overpower the brand. -- Brand test: If the first viewport could belong to another brand after removing the nav, the branding is too weak. -- Full-bleed hero only: On landing pages and promotional surfaces, the hero image should usually be a dominant edge-to-edge visual plane or background. Do not default to inset hero images, side-panel hero images, rounded media cards, tiled collages, or floating image blocks unless the existing design system clearly requires them. -- Hero budget: The first viewport should usually contain only the brand, one headline, one short supporting sentence, one CTA group, and one dominant image. Do not place stats, schedules, event listings, address blocks, promos, "this week" callouts, metadata rows, or secondary marketing content there. -- No hero overlays: Do not place detached labels, floating badges, promo stickers, info chips, or callout boxes on top of hero media. -- Cards: Default to no cards. Never use cards in the hero unless they are the container for a user interaction. If removing a border, shadow, background, or radius does not hurt interaction or understanding, it should not be a card. -- One job per section: Each section should have one purpose, one headline, and usually one short supporting sentence. -- Real visual anchor: Imagery should show the product, place, atmosphere, or context. -- Reduce clutter: Avoid pill clusters, stat strips, icon rows, boxed promos, schedule snippets, and competing text blocks. -- Use motion to create presence and hierarchy, not noise. Ship 2-3 intentional motions for visually led work, and prefer Framer Motion when it is available. +Plain conversational formatting: -Exception: If working within an existing website or design system, preserve the established patterns, structure, and visual language. - +```text +Let formatting serve comprehension. Use plain paragraphs as the default format for normal conversation, explanations, reports, documentation, and technical writeups. Keep the presentation clean and readable without making the structure feel heavier than the content. + +Use headers, bold text, bullets, and numbered lists sparingly. Reach for them when the user requests them, when the answer needs clear comparison or ranking, or when the information would be harder to scan as prose. Otherwise, favor short paragraphs and natural transitions. + +Respect formatting preferences from the user. If they ask for a terse answer, minimal formatting, no bullets, no headers, or a specific structure, follow that preference unless there is a strong reason not to. ``` -```xml - -- Only run shell commands via the terminal tool. -- Never "run" tool names as shell commands. -- If a patch or edit tool exists, use it directly; do not attempt it in bash. -- After changes, run a lightweight verification step such as ls, tests, or a build before declaring the task done. - +Add explicit audience and length guidance: + +```text +Write for a senior business audience. Keep the answer under 400 words. Use short paragraphs and only include bullets when they improve scannability. Prioritize the conclusion first, then the reasoning, then caveats. ``` -### Document localization and OCR boxes +For editing, rewriting, summaries, or customer-facing messages, tell the model what to preserve before asking it to improve style. This pattern is useful when you want polish without expansion. -For bbox tasks, be explicit about coordinate conventions and add drift tests. - -```xml - -- Use the specified coordinate format exactly (for example [x1,y1,x2,y2] normalized 0..1). -- For each bbox, include: page, label, text snippet, confidence. -- Add a vertical-drift sanity check: - - ensure bboxes align with the line of text (not shifted up or down). -- If dense layout, process page by page and do a second pass for missed items. - +```text +Preserve the requested artifact, length, structure, and genre first. Quietly improve clarity, flow, and correctness. Do not add new claims, extra sections, or a more promotional tone unless explicitly requested. ``` -### Use runtime and API integration notes +## Grounding, citations, and retrieval budgets -For long-running or tool-heavy agents, the runtime contract matters as much as the prompt contract. +For grounded answers, citation behavior should be part of the prompt. Define what needs support, what counts as enough evidence, and how the model should behave when evidence is missing. Absence of evidence shouldn't automatically become a factual "no." For more details and examples, see the [citation formatting guide](/api/docs/guides/citation-formatting). -#### Phase parameter +### Add an explicit retrieval budget -For GPT-5.4, `gpt-5.3-codex`, and later Responses models, the `phase` field can -help in the small number of long-running or tool-heavy flows where preambles or -other intermediate assistant updates are mistaken for the final answer. +Retrieval budgets are stopping rules for search. They tell the model when enough evidence is enough. -- `phase` is optional at the API level, but it is highly recommended. Best-effort inference may exist server-side, but explicit round-tripping of `phase` is strictly better. -- Use `phase` for long-running or tool-heavy agents that may emit commentary before tool calls or before a final answer. -- Preserve `phase` when replaying prior assistant items so the model can distinguish working commentary from the completed answer. This matters most in multi-step flows with preambles, tool-related updates, or multiple assistant messages in the same turn. +```text +For ordinary Q&A, start with one broad search using short, discriminative keywords. If the top results contain enough citable support for the core request, answer from those results instead of searching again. + +Make another retrieval call only when: +- The top results do not answer the core question. +- A required fact, parameter, owner, date, ID, or source is missing. +- The user asked for exhaustive coverage, a comparison, or a comprehensive list. +- A specific document, URL, email, meeting, record, or code artifact must be read. +- The answer would otherwise contain an important unsupported factual claim. + +Do not search again to improve phrasing, add examples, cite nonessential details, or support wording that can safely be made more generic. +``` + +## Creative drafting guardrails + +For drafting tasks, tell the model which claims must come from sources and which parts may be creatively written. This is especially important for slides, launch copy, customer summaries, talk tracks, leadership blurbs, and narrative framing. + +```text +For creative or generative requests such as slides, leadership blurbs, outbound copy, summaries for sharing, talk tracks, or narrative framing, distinguish source-backed facts from creative wording. + +- Use retrieved or provided facts for concrete product, customer, metric, roadmap, date, capability, and competitive claims, and cite those claims. +- Do not invent specific names, first-party data claims, metrics, roadmap status, customer outcomes, or product capabilities to make the draft sound stronger. +- If there is little or no citable support, write a useful generic draft with placeholders or clearly labeled assumptions rather than unsupported specifics. +``` + +## Frontend engineering and visual taste + +For frontend work, refer to the [example instructions](/api/docs/guides/frontend-prompt) for practical ways to steer UI quality. They cover product and user context, design-system alignment, first-screen usability, familiar controls, expected states, responsive behavior, and common generated-UI defaults to avoid, such as generic heroes, nested cards, decorative gradients, visible instructional text, and broken layouts. + +## Prompt the model to check its work + +Give GPT-5.5 access to tools that let it check outputs when validation is possible. + +For coding agents, ask for concrete validation commands: + +```text +After making changes, run the most relevant validation available: +- targeted unit tests for changed behavior +- type checks or lint checks when applicable +- build checks for affected packages +- a minimal smoke test when full validation is too expensive + +If validation cannot be run, explain why and describe the next best check. +``` + +For visual artifacts, ask for inspection after rendering: + +```text +Render the artifact before finalizing. Inspect the rendered output for layout, clipping, spacing, missing content, and visual consistency. Revise until the rendered output matches the requirements. +``` + +For engineering and planning tasks, make implementation plans traceable: + +```text +For implementation plans, include: +- requirements and where each is addressed +- named resources, files, APIs, or systems involved +- state transitions or data flow where relevant +- validation commands or checks +- failure behavior +- privacy and security considerations +- open questions that materially affect implementation +``` + +## Phase parameter + +Starting with GPT-5.4, long-running or tool-heavy Responses workflows can use assistant-item `phase` values to distinguish intermediate updates from final answers. GPT-5.5 uses the same pattern. + +If you use `previous_response_id`, the API preserves prior assistant state automatically. If your application manually replays assistant output items into the next request, preserve each original `phase` value and pass it back unchanged. This matters most when a response includes preambles, repeated tool calls, or a final answer after intermediate assistant updates. + +```text +If manually replaying assistant items: +- Preserve assistant `phase` values exactly. +- Use `phase: "commentary"` for intermediate user-visible updates. +- Use `phase: "final_answer"` for the completed answer. - Do not add `phase` to user messages. -- If you use `previous_response_id`, that is usually the simplest path, since OpenAI can often recover prior state without manually replaying assistant items. -- If you replay assistant history yourself, preserve the original `phase` values. -- Missing or dropped `phase` can cause preambles to be interpreted as final answers and degrade behavior on those multi-step tasks. - -### Preserve behavior in long sessions - -Compaction unlocks significantly longer effective context windows, where user conversations can persist for many turns without hitting context limits or long-context performance degradation, and agents can perform very long trajectories that exceed a typical context window for long-running, complex tasks. - -If you are using [Compaction](https://developers.openai.com/api/docs/guides/compaction) in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction. The endpoint is ZDR compatible and returns an `encrypted_content` item that you can pass into future requests. GPT-5.4 tends to remain more coherent and reliable over longer, multi-turn conversations with fewer breakdowns as sessions grow. - -For more guidance, see the [`/responses/compact` API reference](https://developers.openai.com/api/docs/api-reference/responses/compact). - -### Control personality for customer-facing workflows - -GPT-5.4 can be steered more effectively when you separate persistent personality from per-response writing controls. This is especially useful for customer-facing workflows such as emails, support replies, announcements, and blog-style content. - -- **Personality (persistent):** sets the default tone, verbosity, and decision style across the session. -- **Writing controls (per response):** define the channel, register, formatting, and length for a specific artifact. -- **Reminder:** personality should not override task-specific output requirements. If the user asks for JSON, return JSON. - -For natural, high-quality prose, the highest-leverage controls are: - -- Give the model a clear persona. -- Specify the channel and emotional register. -- Explicitly ban formatting when you want prose. -- Use hard length limits. - -```xml - -- Persona: -- Channel: -- Emotional register: + "not " -- Formatting: -- Length: -- Default follow-through: if the request is clear and low-risk, proceed without asking permission. - ``` -For more personality patterns you can lift directly, see the [Prompt Personalities cookbook](https://developers.openai.com/cookbook/examples/gpt-5/prompt_personalities). +## Suggested prompt structure -**Professional memo mode** +Use this structure as a starting point for complex prompts. Keep each section short. Add detail only where it changes behavior. -For memos, reviews, and other professional writing tasks, general writing instructions are often not enough. These workflows benefit from explicit guidance on specificity, domain conventions, synthesis, and calibrated certainty. +```text +Role: [1-2 sentences defining the model's function, context, and job] -```xml - -- Write in a polished, professional memo style. -- Use exact names, dates, entities, and authorities when supported by the record. -- Follow domain-specific structure if one is requested. -- Prefer precise conclusions over generic hedging. -- When uncertainty is real, tie it to the exact missing fact or conflicting source. -- Synthesize across documents rather than summarizing each one independently. - +# Personality +[tone, demeanor, and collaboration style] + +# Goal +[user-visible outcome] + +# Success criteria +[what must be true before the final answer] + +# Constraints +[policy, safety, business, evidence, and side-effect limits] + +# Output +[sections, length, and tone] + +# Stop rules +[when to retry, fallback, abstain, ask, or stop] ``` - -This mode is especially useful for legal, policy, research, and executive-facing writing, where the goal is not just fluency, but disciplined synthesis and clear conclusions. - -## Tune reasoning and migration - -### Treat reasoning effort as a last-mile knob - -Reasoning effort is not one-size-fits-all. Treat it as a last-mile tuning knob, not the primary way to improve quality. In many cases, stronger prompts, clear output contracts, and lightweight verification loops recover much of the performance teams might otherwise seek through higher reasoning settings. - -Recommended defaults: - -- `none`: Best for fast, cost-sensitive, latency-sensitive tasks where the model does not need to think. -- `low`: Works well for latency-sensitive tasks where a small amount of thinking can produce a meaningful accuracy gain, especially with complex instructions. -- `medium` or `high`: Reserve for tasks that truly require stronger reasoning and can absorb the latency and cost tradeoff. Choose between them based on how much performance gain your task gets from additional reasoning. -- `xhigh`: Avoid as a default unless your evals show clear benefits. It is best suited for long, agentic, reasoning-heavy tasks where maximum intelligence matters more than speed or cost. - -In practice, most teams should default to the `none`, `low`, or `medium` range. - -Start with `none` for execution-heavy workloads such as workflow steps, field extraction, support triage, and short structured transforms. - -Start with `medium` or higher for research-heavy workloads such as long-context synthesis, multi-document review, conflict resolution, and strategy writing. With `medium` and a well-engineered prompt, you can squeeze out a lot of performance. - -For GPT-5.4 workloads, `none` can already perform well on action-selection and tool-discipline tasks. If your workload depends on nuanced interpretation, such as implicit requirements, ambiguity, or cancelled-tool-call recovery, start with `low` or `medium` instead. - -Before increasing reasoning effort, first add: - -- `` -- `` -- `` - -If the model still feels too literal or stops at the first plausible answer, add an initiative nudge before raising reasoning effort: - -```xml - -- Don’t stop at the first plausible answer. -- Look for second-order issues, edge cases, and missing constraints. -- If the task is safety or accuracy critical, perform at least one verification step. - -``` - -### Migrate prompts to GPT-5.4 one change at a time - -Use the same one-change-at-a-time discipline as the 5.2 guide: switch model first, pin `reasoning_effort`, run evals, then iterate. - -These starting points work well for many migrations: - -| Current setup | Suggested GPT-5.4 start | Notes | -| ------------------------- | ---------------------------------- | ------------------------------------------------------------------- | -| `gpt-5.2` | Match the current reasoning effort | Preserve the existing latency and quality profile first, then tune. | -| `gpt-5.3-codex` | Match the current reasoning effort | For coding workflows, keep the reasoning effort the same. | -| `gpt-4.1` or `gpt-4o` | `none` | Keep snappy behavior, and increase only if evals regress. | -| Research-heavy assistants | `medium` or `high` | Use explicit research multi-pass and citation gating. | -| Long-horizon agents | `medium` or `high` | Add tool persistence and completeness accounting. | - -### Small-model guidance for `gpt-5.4-mini` and `gpt-5.4-nano` - -`gpt-5.4-mini` and `gpt-5.4-nano` are highly steerable, but they are less likely than larger models to infer missing steps, resolve ambiguity implicitly, or package outputs the way you intended unless you specify that behavior directly. In practice, prompts for smaller models are often a bit longer and more explicit. - -**How `gpt-5.4-mini` differs** - -- `gpt-5.4-mini` is more literal and makes fewer assumptions. -- It is strong when the task is clearly structured, but weaker on implicit workflows and ambiguity handling. -- By default, it may try to keep the conversation going with a follow-up question unless you suppress that behavior explicitly. - -**Prompting `gpt-5.4-mini`** - -- Put critical rules first. -- Specify the full execution order when tool use or side effects matter. -- Do not rely on "you MUST" alone. Use structural scaffolding such as numbered steps, decision rules, and explicit action definitions. -- Separate "do the action" from "report the action." -- Show the correct flow, not just the final format. -- Define ambiguity behavior explicitly: when to ask, abstain, or proceed. -- Specify packaging directly: answer length, whether to ask a follow-up question, citation style, and section order. -- Be careful with `output nothing else`. Prefer scoped instructions such as `after the final JSON, output nothing further`. - -**Prompting `gpt-5.4-nano`** - -- Use `gpt-5.4-nano` only for narrow, well-bounded tasks. -- Prefer closed outputs: labels, enums, short JSON, or fixed templates. -- Avoid multi-step orchestration unless the flow is extremely constrained. -- Route ambiguous or planning-heavy tasks to a stronger model instead of over-prompting `gpt-5.4-nano`. - -**Good default pattern** - -1. Task -2. Critical rule -3. Exact step order -4. Edge cases or clarification behavior -5. Output format -6. One correct example - -**Avoid** - -- Implied next steps -- Unspecified edge cases -- Schema-only prompts for tool workflows -- Generic instructions without structure - -### Web search and deep research - -If you are migrating a research agent in particular, make these prompt updates before increasing reasoning effort: - -- Add `` -- Add `` -- Add `` -- Increase `reasoning_effort` one notch only after prompt fixes. - -You can start from the 5.2 research block and then layer in citation gating and finalization contracts as needed. - -GPT-5.4 performs especially well when the task requires multi-step evidence gathering, long-context synthesis, and explicit prompt contracts. In practice, the highest-leverage prompt changes are choosing reasoning effort by task shape, defining exact output and citation formats, adding dependency-aware tool rules, and making completion criteria explicit. The model is often strong out of the box, but it is most reliable when prompts clearly specify how to search, how to verify, and what counts as done. - -## Next steps - -- Read [our latest model guide](https://developers.openai.com/api/docs/guides/latest-model) for model capabilities, parameters, and API compatibility details. -- Read [Prompt engineering](https://developers.openai.com/api/docs/guides/prompt-engineering) for broader prompting strategies that apply across model families. -- Read [Compaction](https://developers.openai.com/api/docs/guides/compaction) if you are building long-running GPT-5.4 sessions in the Responses API. \ No newline at end of file diff --git a/codex-rs/skills/src/assets/samples/openai-docs/references/upgrade-guide.md b/codex-rs/skills/src/assets/samples/openai-docs/references/upgrade-guide.md index 749bf1b37..07b90c655 100644 --- a/codex-rs/skills/src/assets/samples/openai-docs/references/upgrade-guide.md +++ b/codex-rs/skills/src/assets/samples/openai-docs/references/upgrade-guide.md @@ -1,14 +1,15 @@ -# Upgrading to GPT-5.4 +# Upgrading to GPT-5.5 -Use this guide when the user explicitly asks to upgrade an existing integration to GPT-5.4. Pair it with current OpenAI docs lookups. The default target string is `gpt-5.4`. +Use this guide when the user explicitly asks to upgrade an existing integration to GPT-5.5. Pair it with current OpenAI docs lookups. The default target string is `gpt-5.5`. ## Freshness check -Before applying this bundled guide, run `node scripts/resolve-latest-model-info.js` from the OpenAI Docs skill directory. +Before applying this bundled guide for a latest/current/default model upgrade, run `node scripts/resolve-latest-model-info.js` from the OpenAI Docs skill directory. -- If the command returns `modelSlug: "gpt-5p4"`, continue with this bundled guide and use `references/prompting-guide.md` when prompt updates are needed. +- If the command returns `modelSlug: "gpt-5p5"`, continue with this bundled guide and use `references/prompting-guide.md` when prompt updates are needed. - If the command returns a different `modelSlug`, fetch both the returned `migrationGuideUrl` and `promptingGuideUrl` and use them as the current source of truth instead of the bundled references. -- If the command fails, the metadata is missing, or either remote guide cannot be fetched, continue with the bundled fallback references and say the remote freshness check was unavailable. +- If the command fails, metadata is missing, or either remote guide cannot be fetched, continue with bundled fallback references and say the remote freshness check was unavailable. +- If the user explicitly named a target model, preserve that target and use current docs only to check compatibility or caveats. ## Upgrade posture @@ -16,8 +17,9 @@ Upgrade with the narrowest safe change set: - replace the model string first - update only the prompts that are directly tied to that model usage +- do not automatically upgrade older or ambiguous model usages that may be intentionally pinned, such as historical docs, examples, tests, eval baselines, comparison code, or low-cost fallback/routing paths. Unless the user explicitly asks to upgrade all model usage, leave those sites unchanged and list them as confirmation-needed - prefer prompt-only upgrades when possible -- if the upgrade would require API-surface changes, parameter rewrites, tool rewiring, or broader code edits, mark it as blocked instead of stretching the scope +- if the upgrade would require API-surface changes, parameter rewrites, tool rewiring, provider migration, or broader code edits, mark it as blocked instead of stretching the scope ## Upgrade workflow @@ -28,34 +30,39 @@ Upgrade with the narrowest safe change set: - Prefer the closest prompt surface first: inline system or developer text, then adjacent prompt files, then shared templates. - If you cannot confidently tie a prompt to the model usage, say so instead of guessing. 3. Classify the source model family. - - Common buckets: `gpt-4o` or `gpt-4.1`, `o1` or `o3` or `o4-mini`, early `gpt-5`, later `gpt-5.x`, or mixed and unclear. + - Common buckets: GPT-5.4, GPT-5.3-Codex or GPT-5.2-Codex, earlier GPT-5.x, GPT-4o or GPT-4.1, reasoning models such as o1 or o3 or o4-mini, third-party model, or mixed and unclear. 4. Decide the upgrade class. - `model string only` - `model string + light prompt rewrite` - `blocked without code changes` -5. Run the no-code compatibility gate. - - Check whether the current integration can accept `gpt-5.4` without API-surface changes or implementation changes. +5. Run the compatibility gate. + - Check whether the current integration can accept `gpt-5.5` without API-surface changes or implementation changes. + - Check whether structured outputs, tool schemas, function names, and downstream parsers can remain unchanged. - For long-running Responses or tool-heavy agents, check whether `phase` is already preserved or round-tripped when the host replays assistant items or uses preambles. - If compatibility depends on code changes, return `blocked`. - If compatibility is unclear, return `unknown` rather than improvising. -6. Recommend the upgrade. - - Default replacement string: `gpt-5.4` +6. Apply the upgrade when it is in scope. + - Default replacement string: `gpt-5.5`. - Keep the intervention small and behavior-preserving. -7. Deliver a structured recommendation. + - Start from the current reasoning effort when it is visible unless there is a measured reason to change it. + - For in-scope changes, update the model string and directly related prompts. + - For blocked or unknown changes, do not edit; report the blocker or uncertainty. +7. Summarize the result. - `Current model usage` - - `Recommended model-string updates` - - `Starting reasoning recommendation` + - `Model-string updates` + - `Reasoning-effort handling` - `Prompt updates` + - `Structured output and formatting assessment` + - `Tool-use assessment` when the flow uses tools, retrieval, or terminal actions - `Phase assessment` when the flow is long-running, replayed, or tool-heavy - - `No-code compatibility check` - - `Validation plan` - - `Launch-day refresh items` + - `Compatibility check` + - `Validation performed` Output rule: - Always emit a starting `reasoning_effort_recommendation` for each usage site. -- If the repo exposes the current reasoning setting, preserve it first unless the source guide says otherwise. -- If the repo does not expose the current setting, use the source-family starting mapping instead of returning `null`. +- If the repo exposes the current reasoning setting, preserve it first unless current OpenAI docs say otherwise. +- If the repo does not expose the current setting, do not add one unless current OpenAI docs require it. ## Upgrade outcomes @@ -63,39 +70,41 @@ Output rule: Choose this when: +- the source model is GPT-5.4 - the existing prompts are already short, explicit, and task-bounded -- the workflow is not strongly research-heavy, tool-heavy, multi-agent, batch or completeness-sensitive, or long-horizon +- the workflow does not rely on strict output formats, tool-call behavior, batch completeness, or long-horizon execution that should be validated after the upgrade - there are no obvious compatibility blockers Default action: -- replace the model string with `gpt-5.4` +- replace the model string with `gpt-5.5` +- preserve the current reasoning effort - keep prompts unchanged -- validate behavior with existing evals or spot checks +- validate behavior with existing tests, realistic spot checks, or an existing eval suite when one is already available ### `model string + light prompt rewrite` Choose this when: -- the old prompt was compensating for weaker instruction following -- the workflow needs more persistence than the default tool-use behavior will likely provide -- the task needs stronger completeness, citation discipline, or verification -- the upgraded model becomes too verbose or under-complete unless instructed otherwise +- the task needs stronger completeness, citation discipline, verification, or dependency handling +- the upgraded model becomes too verbose, too dense, or hard to scan unless formatting is constrained +- the workflow has strict output shape requirements and lacks an explicit format contract, schema, or parser validation - the workflow is research-heavy and needs stronger handling of sparse or empty retrieval results -- the workflow is coding-oriented, tool-heavy, or multi-agent, but the existing API surface and tool definitions can remain unchanged +- the workflow is coding-oriented, terminal-based, tool-heavy, or multi-agent, but the existing API surface and tool definitions can remain unchanged Default action: -- replace the model string with `gpt-5.4` -- add one or two targeted prompt blocks -- read `references/prompting-guide.md` to choose the smallest prompt changes that preserve the intended behavior and take advantage of relevant model-specific guidance +- replace the model string with `gpt-5.5` +- preserve the current reasoning effort for the first pass +- make only the smallest prompt edits needed for the observed workflow risk +- read the [GPT-5.5 prompting guide](/api/docs/guides/prompt-guidance?model=gpt-5.5) to choose the smallest prompt changes that recover or improve behavior - avoid broad prompt cleanup unrelated to the upgrade -- for research workflows, default to `research_mode` + `citation_rules` + `empty_result_recovery`; add `tool_persistence_rules` when the host already uses retrieval tools +- for research workflows, default to `research_mode` + `citation_rules` + `empty_result_handling`; add `tool_persistence_rules` when the host already uses retrieval tools - for dependency-aware or tool-heavy workflows, default to `tool_persistence_rules` + `dependency_checks` + `verification_loop`; add `parallel_tool_calling` only when retrieval steps are truly independent - for coding or terminal workflows, default to `terminal_tool_hygiene` + `verification_loop` - for multi-agent support or triage workflows, default to at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop` - for long-running Responses agents with preambles or multiple assistant messages, explicitly review whether `phase` is already handled; if adding or preserving `phase` would require code edits, mark the path as `blocked` -- do not classify a coding or tool-using Responses workflow as `blocked` just because the visible snippet is minimal; prefer `model string + light prompt rewrite` unless the repo clearly shows that a safe GPT-5.4 path would require host-side code changes +- do not classify a coding or tool-using Responses workflow as `blocked` just because the visible snippet is minimal; prefer `model string + light prompt rewrite` unless the repo clearly shows that a safe GPT-5.5 path would require host-side code changes ### `blocked` @@ -104,25 +113,30 @@ Choose this when: - the upgrade appears to require API-surface changes - the upgrade appears to require parameter rewrites or reasoning-setting changes that are not exposed outside implementation code - the upgrade would require changing tool definitions, tool handler wiring, or schema contracts +- the user is asking for a tooling, IDE, plugin, shell, or environment migration rather than a model and prompt migration +- the integration depends on provider-specific APIs that do not map to the current OpenAI API surface without implementation work - you cannot confidently identify the prompt surface tied to the model usage Default action: - do not improvise a broader upgrade - report the blocker and explain that the fix is out of scope for this guide +- if useful, describe the smallest follow-up implementation task that would unblock the migration -## No-code compatibility checklist +## Compatibility checklist -Before recommending a no-code upgrade, check: +Before applying or recommending a model-and-prompt-only upgrade, check: -1. Can the current host accept the `gpt-5.4` model string without changing client code or API surface? +1. Can the current host accept the `gpt-5.5` model string without changing client code or API surface? 2. Are the related prompts identifiable and editable? -3. Does the host depend on behavior that likely needs API-surface changes, parameter rewrites, or tool rewiring? +3. Does the host depend on behavior that likely needs API-surface changes, parameter rewrites, provider migration, or tool rewiring? 4. Would the likely fix be prompt-only, or would it need implementation changes? 5. Is the prompt surface close enough to the model usage that you can make a targeted change instead of a broad cleanup? -6. For long-running Responses or tool-heavy agents, is `phase` already preserved if the host relies on preambles, replayed assistant items, or multiple assistant messages? +6. Do strict structured outputs, schemas, or downstream parsers still have an explicit contract? +7. For long-running Responses or tool-heavy agents, is `phase` already preserved if the host relies on preambles, replayed assistant items, or multiple assistant messages? +8. Are latency, token, or price assumptions validated by tests, realistic spot checks, or an existing eval suite rather than inferred from general model positioning? -If item 1 is no, items 3 through 4 point to implementation work, or item 6 is no and the fix needs code changes, return `blocked`. +If item 1 is no, items 3 through 4 point to implementation work, or item 7 is no and the fix needs code changes, return `blocked`. If item 2 is no, return `unknown` unless the user can point to the prompt location. @@ -131,6 +145,7 @@ Important: - Existing use of tools, agents, or multiple usage sites is not by itself a blocker. - If the current host can keep the same API surface and the same tool definitions, prefer `model string + light prompt rewrite` over `blocked`. - Reserve `blocked` for cases that truly require implementation changes, not cases that only need stronger prompt steering. +- Do not claim token savings without task-level validation. ## Scope boundaries @@ -141,32 +156,26 @@ This guide may: - inspect code and prompt files to understand where those changes belong - inspect whether existing Responses flows already preserve `phase` - flag compatibility blockers +- propose validation with existing tests, realistic spot checks, or existing eval suites This guide may not: - move Chat Completions code to Responses - move Responses code to another API surface +- migrate SDKs, APIs, IDE configuration, shell hooks, plugins, or provider-specific tooling - rewrite parameter shapes - change tool definitions or tool-call handling - change structured-output wiring - add or retrofit `phase` handling in implementation code -- edit business logic, orchestration logic, or SDK usage beyond a literal model-string replacement +- edit business logic, orchestration logic, SDK usage, IDE configuration, shell hooks, or plugin integration behavior except for model-string replacements and directly related prompt edits -If a safe GPT-5.4 upgrade requires any of those changes, mark the path as blocked and out of scope. +If a safe GPT-5.5 upgrade requires any of those changes, mark the path as blocked and out of scope. ## Validation plan -- Validate each upgraded usage site with existing evals or realistic spot checks. -- Check whether the upgraded model still matches expected latency, output shape, and quality. +- Validate each upgraded usage site with existing tests, realistic spot checks, or an existing eval suite when one is already available. +- Compare against the current GPT-5.4 baseline when available. +- Check task success, retry count, tool-call count, total tokens, latency, output shape, and user-visible quality. +- For specialized workflows, validate the contract that matters most instead of judging only general output quality. - If prompt edits were added, confirm each block is doing real work instead of adding noise. - If the workflow has downstream impact, add a lightweight verification pass before finalization. - -## Launch-day refresh items - -When final GPT-5.4 guidance changes: - -1. Replace release-candidate assumptions with final GPT-5.4 guidance where appropriate. -2. Re-check whether the default target string should stay `gpt-5.4` for all source families. -3. Re-check any prompt-block recommendations whose semantics may have changed. -4. Re-check research, citation, and compatibility guidance against the final model behavior. -5. Re-run the same upgrade scenarios and confirm the blocked-versus-viable boundaries still hold. diff --git a/codex-rs/skills/src/assets/samples/openai-docs/scripts/resolve-latest-model-info.js b/codex-rs/skills/src/assets/samples/openai-docs/scripts/resolve-latest-model-info.js index 2a47ff05e..1bd16ac9b 100755 --- a/codex-rs/skills/src/assets/samples/openai-docs/scripts/resolve-latest-model-info.js +++ b/codex-rs/skills/src/assets/samples/openai-docs/scripts/resolve-latest-model-info.js @@ -71,7 +71,7 @@ function parseFlatInfo(block) { const info = {}; for (const line of block.split(/\r?\n/)) { - const match = line.match(/^([A-Za-z][A-Za-z0-9_-]*):\s*(.+?)\s*$/); + const match = line.match(/^\s*([A-Za-z][A-Za-z0-9_-]*):\s*(.+?)\s*$/); if (match) { info[match[1]] = match[2].replace(/^["']|["']$/g, ""); }