feat: add graph merge scripts and improve agent pipeline reliability

Add Python scripts to merge knowledge graphs (closes #70) and move mechanical normalization out of LLM context into deterministic scripts with diagnostic reporting. Convert all agent definitions from dispatch templates to self-contained system prompts to prevent instruction loss. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-06-22 10:58:03 +08:00 · 2026-04-04 11:56:58 +08:00
parent 3dfacd1c43
commit b3683d62e2
9 changed files with 829 additions and 49 deletions
@@ -6,9 +6,7 @@ description: |
 model: inherit
 ---

-# Architecture Analyzer — Prompt Template
-
-> Used by `/understand` Phase 4. Dispatch as a subagent with this full content as the prompt.
+# Architecture Analyzer

 You are an expert software architect. Your job is to analyze a codebase's file structure, summaries, and import relationships to identify logical architectural layers and assign every file to exactly one layer. Your layer assignments must be well-reasoned and reflect the actual organization of the code, including non-code files like configs, documentation, infrastructure, and data schemas.

@@ -0,0 +1,87 @@
+---
+name: assemble-reviewer
+description: |
+  Reviews the output of merge-batch-graphs.py for semantic issues the script
+  cannot catch. Recovers dropped nodes/edges and fills cross-batch gaps.
+model: inherit
+---
+
+# Assemble Reviewer
+
+You are a quality reviewer for the assembled knowledge graph produced by `merge-batch-graphs.py`. The script has already applied all mechanical fixes — your job is to handle what it **could not fix** and verify the fixes look sane.
+
+## Context
+
+The merge script reads batch analysis results (`batch-*.json`), combines them, and writes `assembled-graph.json`. It applies these mechanical fixes automatically:
+- Normalizes node IDs (strips double prefixes, project-name prefixes, adds missing prefixes, canonicalizes `func:` → `function:`)
+- Normalizes complexity values to `simple`/`moderate`/`complex` for known mappings
+- Rewrites edge `source`/`target` references to match corrected node IDs
+- Deduplicates nodes by ID (keeps last) and edges by `(source, target, type)` (keeps higher weight)
+- Drops edges referencing nodes that don't exist in the merged set
+
+The script produces a stderr report with two sections:
+- **Fixed**: pattern-grouped counts of what it corrected (e.g., `170 × func: → function:`)
+- **Could not fix**: issues that need your judgment (unknown types, unknown complexity values, dropped items)
+
+## Your Task
+
+You will receive the script's report, the path to `assembled-graph.json`, and the project's `$IMPORT_MAP`. Work through these steps in order.
+
+### Step 1 — Sanity-check the "Fixed" section
+
+Review the pattern counts. You do NOT redo any fixes. Just verify the numbers are reasonable:
+- If a single pattern dominates (e.g., 100% of function nodes had `func:` prefix), that's a systemic LLM output pattern — expected, move on.
+- If a large percentage of nodes needed ID correction (>30%), flag this as a potential upstream issue in your notes.
+- If complexity values were heavily skewed to one unknown value, note it.
+
+### Step 2 — Investigate the "Could not fix" section
+
+For each issue listed, take action:
+
+**Nodes with no `id` field:**
+- Read the corresponding batch file to find the original node data.
+- If you can determine what the ID should be (from the node's `type`, `filePath`, and `name`), construct the ID following the convention `<type-prefix>:<filePath>[:<name>]` and add the node to `assembled-graph.json`.
+- If the node is too malformed to recover, skip it and note it in your report.
+
+**Unknown node types** (e.g., `"widget"`, `"helper"`):
+- Check if the type is a known alias or typo for a valid type (e.g., `"func"` → `"function"`, `"doc"` → `"document"`, `"svc"` → `"service"`).
+- If mappable, fix the node's `type` field and update its ID prefix accordingly.
+- If genuinely unknown, leave as-is and note it in your report.
+
+**Unknown complexity values** (e.g., `"very low"`, `"trivial"`):
+- Use your judgment to map to the closest valid value (`simple`, `moderate`, or `complex`).
+- Update the node in `assembled-graph.json`.
+
+**Dropped dangling edges:**
+- For each dropped edge, check if the missing node should exist:
+  - Was the file analyzed? (Check the batch files or scan result)
+  - Did the batch produce a node that got dropped due to missing ID? (Cross-reference with the "no id" items above)
+- If the node should exist, re-create it with sensible defaults (`summary: "No summary available"`, `tags: ["untagged"]`, `complexity: "moderate"`) and restore the edge.
+- If the target genuinely doesn't exist (e.g., external dependency), skip it.
+
+### Step 3 — Check for cross-batch edge gaps
+
+The merge script combines what each batch produced independently. Batches don't know about each other's internal nodes (functions, classes). Using the `$IMPORT_MAP` provided in your prompt:
+
+- For each import relationship in `$IMPORT_MAP`, verify a corresponding `imports` edge exists in the assembled graph.
+- If an edge is missing between two file nodes that should be connected, add it with `type: "imports"`, `direction: "forward"`, `weight: 0.7`.
+- Do NOT add speculative edges — only add edges that are backed by `$IMPORT_MAP` data.
+
+### Step 4 — Write results
+
+1. Apply all fixes directly to `assembled-graph.json`.
+2. Write a summary to the review output path provided in your prompt:
+
+```json
+{
+  "fixedSectionOk": true,
+  "nodesRecovered": 0,
+  "edgesRestored": 0,
+  "crossBatchEdgesAdded": 0,
+  "typesRemapped": 0,
+  "complexityRemapped": 0,
+  "notes": ["any observations about data quality"]
+}
+```
+
+3. Respond with a brief text summary: what you found, what you fixed, and any remaining concerns.
@@ -7,9 +7,7 @@ description: |
 model: inherit
 ---

-# File Analyzer — Prompt Template
-
-> Used by `/understand` Phase 2. Dispatch as a subagent with this full content as the prompt.
+# File Analyzer

 You are an expert code analyst. Your job is to read source files and produce precise, structured knowledge graph data (nodes and edges) that accurately represents the code's structure, purpose, and relationships. You must be thorough yet concise, and every piece of data you produce must be grounded in the actual source code.

@@ -6,9 +6,7 @@ description: |
 model: inherit
 ---

-# Graph Reviewer — Prompt Template
-
-> Used by `/understand` Phase 6. Dispatch as a subagent with this full content as the prompt.
+# Graph Reviewer

 You are a rigorous QA validator for knowledge graphs produced by the Understand Anything analysis pipeline. Your job is to systematically check the assembled graph for correctness, completeness, and quality, then render an approval or rejection decision with clear justification.

@@ -6,9 +6,7 @@ description: |
 model: inherit
 ---

-# Project Scanner — Prompt Template
-
-> Used by `/understand` Phase 1. Dispatch as a subagent with this full content as the prompt.
+# Project Scanner

 You are a meticulous project inventory specialist. Your job is to scan a codebase directory and produce a precise, structured inventory of all project files, detected languages, frameworks, and estimated complexity. Accuracy is paramount -- every file path you report must actually exist on disk.

@@ -6,9 +6,7 @@ description: |
 model: inherit
 ---

-# Tour Builder — Prompt Template
-
-> Used by `/understand` Phase 5. Dispatch as a subagent with this full content as the prompt.
+# Tour Builder

 You are an expert technical educator who designs learning paths through codebases. Your job is to create a guided tour of 5-15 steps that teaches someone the project's architecture and key concepts in a logical, pedagogical order. Each step should build on previous ones, creating a coherent narrative that takes a newcomer from "What is this project?" to "I understand how it works."

@@ -1,7 +1,7 @@
 ---
 name: understand
 description: Analyze a codebase to produce an interactive knowledge graph for understanding architecture, components, and relationships
-argument-hint: [options]
+argument-hint: [--full|--auto-update|--no-auto-update|--review]
 ---

 # /understand
@@ -38,9 +38,16 @@ Determine whether to run a full analysis or incremental update.
   - If `--no-auto-update` is in `$ARGUMENTS`: write `{"autoUpdate": false}` to `$PROJECT_ROOT/.understand-anything/config.json`
   - These flags only set the config — analysis proceeds normally regardless.

-4. Check if `$PROJECT_ROOT/.understand-anything/knowledge-graph.json` exists. If it does, read it.
-5. Check if `$PROJECT_ROOT/.understand-anything/meta.json` exists. If it does, read it to get `gitCommitHash`.
-6. **Decision logic:**
+4. **Check for subdomain knowledge graphs to merge:**
+   List all `*knowledge-graph*.json` files in `$PROJECT_ROOT/.understand-anything/` **excluding** `knowledge-graph.json` itself (e.g. `frontend-knowledge-graph.json`, `backend-knowledge-graph.json`). If any subdomain graphs exist, run the merge script bundled with this skill:
+   ```bash
+   python ./merge-subdomain-graphs.py $PROJECT_ROOT
+   ```
+   The script discovers subdomain graphs, loads the existing `knowledge-graph.json` as a base (if present), and merges everything into `knowledge-graph.json` (deduplicating nodes and edges). Report the merge summary to the user, then continue with the merged graph.
+
+5. Check if `$PROJECT_ROOT/.understand-anything/knowledge-graph.json` exists. If it does, read it.
+6. Check if `$PROJECT_ROOT/.understand-anything/meta.json` exists. If it does, read it to get `gitCommitHash`.
+7. **Decision logic:**

   | Condition | Action |
   |---|---|
@@ -58,7 +65,7 @@ Determine whether to run a full analysis or incremental update.
   ```
   If this returns no files, report "Graph is up to date" and STOP.

-7. **Collect project context for subagent injection:**
+8. **Collect project context for subagent injection:**
   - Read `README.md` (or `README.rst`, `readme.md`) from `$PROJECT_ROOT` if it exists. Store as `$README_CONTENT` (first 3000 characters).
   - Read the primary package manifest (`package.json`, `pyproject.toml`, `Cargo.toml`, `go.mod`, `pom.xml`) if it exists. Store as `$MANIFEST_CONTENT`.
   - Capture the top-level directory tree:
@@ -104,7 +111,7 @@ After the subagent completes, read `$PROJECT_ROOT/.understand-anything/intermedi
 Store `importMap` in memory as `$IMPORT_MAP` for use in Phase 2 batch construction.
 Store the file list as `$FILE_LIST` with `fileCategory` metadata for use in Phase 2 batch construction.

-**Gate check:** If >200 files, inform the user and suggest scoping with a subdirectory argument. Proceed only if user confirms or add guidance that this may take a while.
+**Gate check:** If >100 files, inform the user and suggest scoping with a subdirectory argument. Proceed only if user confirms or add guidance that this may take a while.

 ---

@@ -157,45 +164,61 @@ Fill in batch-specific parameters below and dispatch:
 > 2. `<path>` (<sizeLines> lines, fileCategory: `<fileCategory>`)
 > ...

-After ALL batches complete, read each `batch-<N>.json` file and merge:
- Combine all `nodes` arrays. If duplicate node IDs exist, keep the later occurrence.
- Combine all `edges` arrays. Deduplicate by the composite key `source + target + type`.
+After ALL batches complete, run the merge-and-normalize script bundled with this skill:
+```bash
+python ./merge-batch-graphs.py $PROJECT_ROOT
+```
+
+This script reads all `batch-*.json` files from `$PROJECT_ROOT/.understand-anything/intermediate/`, then in one pass:
+- Combines all nodes and edges across batches
+- Normalizes node IDs (strips double prefixes, project-name prefixes, adds missing prefixes)
+- Normalizes complexity values (`low`→`simple`, `medium`→`moderate`, `high`→`complex`, etc.)
+- Rewrites edge references to match corrected node IDs
+- Deduplicates nodes by ID (keeps last occurrence) and edges by `(source, target, type)`
+- Drops dangling edges referencing missing nodes
+- Logs all corrections and dropped items to stderr
+
+Output: `$PROJECT_ROOT/.understand-anything/intermediate/assembled-graph.json`
+
+Include the script's warnings in `$PHASE_WARNINGS` for the reviewer.

 ### Incremental update path

 Use the changed files list from Phase 0. Batch and dispatch file-analyzer subagents using the same process as above (20-30 files per batch, up to 5 concurrent, with batchImportData constructed from $IMPORT_MAP), but only for changed files.

-After batches complete, merge with the existing graph:
-1. Remove old nodes whose `filePath` matches any changed file
+After batches complete:
+1. Remove old nodes whose `filePath` matches any changed file from the existing graph
 2. Remove old edges whose `source` or `target` references a removed node
-3. Add new nodes and edges from the fresh analysis
+3. Write the pruned existing nodes/edges as `batch-existing.json` in the intermediate directory
+4. Run the same merge script — it will combine `batch-existing.json` with the fresh `batch-*.json` files:
+   ```bash
+   python ./merge-batch-graphs.py $PROJECT_ROOT
+   ```

 ---

-## Phase 3 — ASSEMBLE
+## Phase 3 — ASSEMBLE REVIEW

-Merge all file-analyzer results into a single set of nodes and edges. Then perform normalization and integrity cleanup **in this order**:
+Dispatch a subagent using the `assemble-reviewer` agent definition (at `agents/assemble-reviewer.md`).

-1. **Normalize node IDs:** For every node, verify the `id` field follows the convention `<type-prefix>:<path>` where type-prefix is one of `file`, `func`, `class`, `module`, `concept`, `config`, `document`, `service`, `table`, `endpoint`, `pipeline`, `schema`, `resource`. Apply these fixes:
-   - If the ID has a double prefix (e.g., `file:file:src/foo.ts`), strip the duplicate prefix.
-   - If the ID has a project-name prefix (e.g., `my-project:file:src/foo.ts`), strip the project-name portion.
-   - If the ID is a bare file path with no prefix, add the appropriate prefix based on the node's `type` field: `file` → `file:<path>`, `function` → `func:<filePath>:<name>`, `class` → `class:<filePath>:<name>`.
-   - Build a mapping of original IDs → corrected IDs.
+Pass these parameters in the dispatch prompt:

-2. **Normalize complexity values:** For every node, verify `complexity` is one of `"simple"`, `"moderate"`, `"complex"`. Apply these mappings for invalid values:
-   - `"low"`, `"easy"` → `"simple"`
-   - `"medium"`, `"intermediate"` → `"moderate"`
-   - `"high"`, `"hard"`, `"difficult"` → `"complex"`
-   - Numeric 1-3 → `"simple"`, 4-6 → `"moderate"`, 7-10 → `"complex"`
-   - Any other value → `"moderate"`
+> Review the assembled graph at `$PROJECT_ROOT/.understand-anything/intermediate/assembled-graph.json`.
+> Project root: `$PROJECT_ROOT`
+> Batch files are at: `$PROJECT_ROOT/.understand-anything/intermediate/batch-*.json`
+> Write review output to: `$PROJECT_ROOT/.understand-anything/intermediate/assemble-review.json`
+>
+> **Merge script report:**
+> ```
+> <paste the full stderr output from merge-batch-graphs.py>
+> ```
+>
+> **Import map for cross-batch edge verification:**
+> ```json
+> $IMPORT_MAP
+> ```

-3. **Rewrite edge references:** Using the ID mapping from step 1, update every edge's `source` and `target` fields. This prevents cascading edge drops when only the ID format was wrong.
-
-4. **Remove duplicate node IDs:** If duplicate node IDs exist after normalization, keep the last occurrence.
-
-5. **Remove dangling edges:** Remove any edge whose `source` or `target` references a node ID that does not exist in the merged node set.
-
-6. **Log changes:** Record counts of IDs corrected, complexity values fixed, edges rewritten, duplicates removed, and dangling edges dropped. Include these counts in the Phase warnings list passed to the reviewer.
+After the subagent completes, read `$PROJECT_ROOT/.understand-anything/intermediate/assemble-review.json` and add any notes to `$PHASE_WARNINGS`.

 ---

@@ -361,8 +384,8 @@ Assemble the full KnowledgeGraph JSON object:
    "analyzedAt": "<ISO 8601 timestamp>",
    "gitCommitHash": "<commit hash from Phase 0>"
  },
-  "nodes": [<all merged nodes from Phase 3>],
-  "edges": [<all merged edges from Phase 3>],
+  "nodes": [<all nodes from assembled-graph.json after Phase 3 review>],
+  "edges": [<all edges from assembled-graph.json after Phase 3 review>],
  "layers": [<layers from Phase 4>],
  "tour": [<steps from Phase 5>]
 }
@@ -0,0 +1,391 @@
+#!/usr/bin/env python3
+"""
+merge-batch-graphs.py — Merge and normalize batch analysis results.
+
+Combines batch-*.json files from the intermediate directory into a single
+assembled graph with normalized IDs, complexity values, and cleaned edges.
+
+Called at the end of Phase 2 of /understand. Phase 3 (ASSEMBLE REVIEW)
+then reviews the output for semantic issues the script cannot catch.
+
+Usage:
+    python merge-batch-graphs.py <project-root>
+
+Input:
+    <project-root>/.understand-anything/intermediate/batch-*.json
+
+Output:
+    <project-root>/.understand-anything/intermediate/assembled-graph.json
+"""
+
+import json
+import re
+import sys
+from collections import Counter
+from pathlib import Path
+from typing import Any
+
+
+# ── Configuration ─────────────────────────────────────────────────────────
+
+VALID_NODE_PREFIXES = {
+    "file", "func", "function", "class", "module", "concept",
+    "config", "document", "service", "table", "endpoint",
+    "pipeline", "schema", "resource",
+}
+
+# node.type → canonical ID prefix
+TYPE_TO_PREFIX: dict[str, str] = {
+    "file": "file",
+    "function": "function",
+    "class": "class",
+    "module": "module",
+    "concept": "concept",
+    "config": "config",
+    "document": "document",
+    "service": "service",
+    "table": "table",
+    "endpoint": "endpoint",
+    "pipeline": "pipeline",
+    "schema": "schema",
+    "resource": "resource",
+}
+
+COMPLEXITY_MAP: dict[str, str] = {
+    "low": "simple",
+    "easy": "simple",
+    "medium": "moderate",
+    "intermediate": "moderate",
+    "high": "complex",
+    "hard": "complex",
+    "difficult": "complex",
+}
+
+VALID_COMPLEXITY = {"simple", "moderate", "complex"}
+
+
+# ── Batch loading ─────────────────────────────────────────────────────────
+
+def load_batch(path: Path) -> dict[str, Any] | None:
+    """Load a batch JSON file, tolerating malformed files."""
+    try:
+        data = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError) as e:
+        print(f"  Warning: skipping {path.name}: {e}", file=sys.stderr)
+        return None
+
+    if not isinstance(data.get("nodes"), list):
+        print(f"  Warning: skipping {path.name}: missing or invalid 'nodes' array", file=sys.stderr)
+        return None
+    if not isinstance(data.get("edges"), list):
+        print(f"  Warning: skipping {path.name}: missing or invalid 'edges' array", file=sys.stderr)
+        return None
+
+    return data
+
+
+# ── ID normalization ──────────────────────────────────────────────────────
+
+def classify_id_fix(original: str, corrected: str) -> str:
+    """Return a human-readable pattern label for an ID correction."""
+    # Double prefix: "file:file:..." → "file:..."
+    for prefix in VALID_NODE_PREFIXES:
+        if original.startswith(f"{prefix}:{prefix}:"):
+            return f"{prefix}:{prefix}: → {prefix}: (double prefix)"
+
+    # Project-name prefix: "my-project:file:..." → "file:..."
+    parts = original.split(":")
+    if len(parts) >= 3 and parts[0] not in VALID_NODE_PREFIXES and parts[1] in VALID_NODE_PREFIXES:
+        return f"<project>:{parts[1]}: → {parts[1]}: (project-name prefix)"
+
+    # Legacy func: → function:
+    if original.startswith("func:") and corrected.startswith("function:"):
+        return "func: → function: (prefix canonicalization)"
+
+    # Bare path → prefixed
+    if not any(original.startswith(f"{p}:") for p in VALID_NODE_PREFIXES):
+        prefix = corrected.split(":")[0]
+        return f"bare path → {prefix}: (missing prefix)"
+
+    return f"{original} → {corrected}"
+
+
+def normalize_node_id(node_id: str, node: dict[str, Any]) -> str:
+    """Normalize a node ID, returning the corrected version."""
+    nid = node_id
+
+    # Strip double prefix: "file:file:src/foo.ts" → "file:src/foo.ts"
+    for prefix in VALID_NODE_PREFIXES:
+        double = f"{prefix}:{prefix}:"
+        if nid.startswith(double):
+            nid = nid[len(prefix) + 1:]
+            break
+
+    # Strip project-name prefix: "my-project:file:src/foo.ts" → "file:src/foo.ts"
+    # Pattern: <word>:<valid-prefix>:<path>
+    match = re.match(r"^[^:]+:(" + "|".join(re.escape(p) for p in VALID_NODE_PREFIXES) + r"):(.+)$", nid)
+    if match:
+        # Only strip if the first segment is NOT a valid prefix itself
+        first_seg = nid.split(":")[0]
+        if first_seg not in VALID_NODE_PREFIXES:
+            nid = f"{match.group(1)}:{match.group(2)}"
+
+    # Canonicalize legacy prefix: func: → function:
+    if nid.startswith("func:") and not nid.startswith("function:"):
+        nid = "function:" + nid[5:]
+
+    # Add missing prefix for bare file paths
+    has_prefix = any(nid.startswith(f"{p}:") for p in VALID_NODE_PREFIXES)
+    if not has_prefix:
+        node_type = node.get("type", "file")
+        prefix = TYPE_TO_PREFIX.get(node_type, "file")
+        if node_type in ("function", "class"):
+            file_path = node.get("filePath", "")
+            name = node.get("name", nid)
+            nid = f"{prefix}:{file_path}:{name}" if file_path else f"{prefix}:{nid}"
+        else:
+            nid = f"{prefix}:{nid}"
+
+    return nid
+
+
+def normalize_complexity(value: Any) -> tuple[str, str]:
+    """Normalize a complexity value. Returns (normalized, status).
+
+    status is one of:
+      "valid"    — already a valid value, no change needed
+      "mapped"   — known alias, confidently mapped (goes to Fixed report)
+      "unknown"  — unrecognized value, defaulted to moderate (goes to Could-not-fix report)
+    """
+    if isinstance(value, str):
+        lower = value.strip().lower()
+        if lower in VALID_COMPLEXITY:
+            return lower, "valid"
+        if lower in COMPLEXITY_MAP:
+            return COMPLEXITY_MAP[lower], "mapped"
+        # Unknown string — default but flag it
+        return "moderate", "unknown"
+    elif isinstance(value, (int, float)):
+        n = int(value)
+        if n <= 3:
+            return "simple", "mapped"
+        elif n <= 6:
+            return "moderate", "mapped"
+        else:
+            return "complex", "mapped"
+    # None or other type — default but flag it
+    return "moderate", "unknown"
+
+
+# ── Main merge + normalize ────────────────────────────────────────────────
+
+def merge_and_normalize(batches: list[dict[str, Any]]) -> tuple[dict[str, Any], list[str]]:
+    """Merge batch results and normalize. Returns (assembled_graph, report_lines)."""
+
+    # ── Pattern counters for "Fixed" report ──────────────────────────
+    id_fix_patterns: Counter[str] = Counter()
+    complexity_fix_patterns: Counter[str] = Counter()
+
+    # ── Detail lists for "Could not fix" report ──────────────────────
+    unfixable: list[str] = []
+
+    # ── Step 1: Combine all nodes and edges ──────────────────────────
+    all_nodes: list[dict] = []
+    all_edges: list[dict] = []
+    for batch in batches:
+        all_nodes.extend(batch.get("nodes", []))
+        all_edges.extend(batch.get("edges", []))
+
+    total_input_nodes = len(all_nodes)
+    total_input_edges = len(all_edges)
+
+    # ── Step 2: Normalize node IDs and build ID mapping ──────────────
+    id_mapping: dict[str, str] = {}  # original → corrected
+    nodes_with_ids: list[dict] = []
+    unknown_node_types: Counter[str] = Counter()
+
+    for i, node in enumerate(all_nodes):
+        original_id = node.get("id")
+        if not original_id:
+            unfixable.append(f"Node[{i}] has no 'id' field (name={node.get('name', '?')}, type={node.get('type', '?')})")
+            continue
+
+        # Flag unknown node types
+        node_type = node.get("type", "")
+        if node_type and node_type not in TYPE_TO_PREFIX:
+            unknown_node_types[node_type] += 1
+
+        nodes_with_ids.append(node)
+        corrected_id = normalize_node_id(original_id, node)
+        if corrected_id != original_id:
+            pattern = classify_id_fix(original_id, corrected_id)
+            id_fix_patterns[pattern] += 1
+            id_mapping[original_id] = corrected_id
+            node["id"] = corrected_id
+
+    # ── Step 3: Normalize complexity ─────────────────────────────────
+    complexity_unknown_patterns: Counter[str] = Counter()
+
+    for node in nodes_with_ids:
+        original = node.get("complexity")
+        normalized, status = normalize_complexity(original)
+
+        if status == "mapped":
+            orig_repr = repr(original) if not isinstance(original, str) else f'"{original}"'
+            complexity_fix_patterns[f"{orig_repr} → \"{normalized}\""] += 1
+        elif status == "unknown":
+            orig_repr = repr(original) if not isinstance(original, str) else f'"{original}"'
+            complexity_unknown_patterns[f"complexity {orig_repr} → defaulted to \"moderate\""] += 1
+
+        node["complexity"] = normalized
+
+    # ── Step 4: Rewrite edge references ──────────────────────────────
+    edges_rewritten = 0
+    for edge in all_edges:
+        src = edge.get("source", "")
+        tgt = edge.get("target", "")
+        new_src = id_mapping.get(src, src)
+        new_tgt = id_mapping.get(tgt, tgt)
+        if new_src != src or new_tgt != tgt:
+            edges_rewritten += 1
+            edge["source"] = new_src
+            edge["target"] = new_tgt
+
+    # ── Step 5: Deduplicate nodes by ID (keep last) ─────────────────
+    duplicate_count = 0
+    nodes_by_id: dict[str, dict] = {}
+    for node in nodes_with_ids:
+        nid = node.get("id", "")
+        if nid in nodes_by_id:
+            duplicate_count += 1
+        nodes_by_id[nid] = node
+
+    # ── Step 6: Deduplicate edges, drop dangling ─────────────────────
+    node_ids = set(nodes_by_id.keys())
+    edges_by_key: dict[tuple[str, str, str], dict] = {}
+    for edge in all_edges:
+        src = edge.get("source", "")
+        tgt = edge.get("target", "")
+        etype = edge.get("type", "")
+
+        if src not in node_ids or tgt not in node_ids:
+            missing = []
+            if src not in node_ids:
+                missing.append(f"source '{src}'")
+            if tgt not in node_ids:
+                missing.append(f"target '{tgt}'")
+            unfixable.append(f"Edge {src} → {tgt} ({etype}): dropped, missing {', '.join(missing)}")
+            continue
+
+        key = (src, tgt, etype)
+        existing = edges_by_key.get(key)
+        if existing is None or edge.get("weight", 0) > existing.get("weight", 0):
+            edges_by_key[key] = edge
+
+    # ── Build report ─────────────────────────────────────────────────
+    report: list[str] = []
+    report.append(f"Input: {total_input_nodes} nodes, {total_input_edges} edges")
+
+    # Fixed section — grouped by pattern
+    fixed_lines: list[str] = []
+    if id_fix_patterns:
+        for pattern, count in id_fix_patterns.most_common():
+            fixed_lines.append(f"  {count:>4} × {pattern}")
+    if complexity_fix_patterns:
+        for pattern, count in complexity_fix_patterns.most_common():
+            fixed_lines.append(f"  {count:>4} × complexity {pattern}")
+    if edges_rewritten:
+        fixed_lines.append(f"  {edges_rewritten:>4} × edge references rewritten after ID normalization")
+    if duplicate_count:
+        fixed_lines.append(f"  {duplicate_count:>4} × duplicate node IDs removed (kept last)")
+
+    if fixed_lines:
+        report.append("")
+        report.append(f"Fixed ({sum(id_fix_patterns.values()) + sum(complexity_fix_patterns.values()) + edges_rewritten + duplicate_count} corrections):")
+        report.extend(fixed_lines)
+
+    # Could not fix section — unknown patterns (grouped) + individual details
+    unfixable_total = (
+        len(unfixable)
+        + sum(complexity_unknown_patterns.values())
+        + sum(unknown_node_types.values())
+    )
+    if unfixable_total:
+        report.append("")
+        report.append(f"Could not fix ({unfixable_total} issues — needs agent review):")
+        # Unknown node types (grouped by count)
+        for ntype, count in unknown_node_types.most_common():
+            report.append(f"  {count:>4} × unknown node type \"{ntype}\" (not in schema, kept as-is)")
+        # Unknown complexity patterns (grouped by count)
+        for pattern, count in complexity_unknown_patterns.most_common():
+            report.append(f"  {count:>4} × {pattern}")
+        # Individual unfixable items
+        for detail in unfixable:
+            report.append(f"  - {detail}")
+
+    # Output stats
+    report.append("")
+    report.append(f"Output: {len(nodes_by_id)} nodes, {len(edges_by_key)} edges")
+
+    assembled = {
+        "nodes": list(nodes_by_id.values()),
+        "edges": list(edges_by_key.values()),
+    }
+
+    return assembled, report
+
+
+# ── Main ──────────────────────────────────────────────────────────────────
+
+def main() -> None:
+    if len(sys.argv) < 2:
+        print("Usage: python merge-batch-graphs.py <project-root>", file=sys.stderr)
+        sys.exit(1)
+
+    project_root = Path(sys.argv[1]).resolve()
+    intermediate_dir = project_root / ".understand-anything" / "intermediate"
+
+    if not intermediate_dir.is_dir():
+        print(f"Error: {intermediate_dir} does not exist", file=sys.stderr)
+        sys.exit(1)
+
+    # Discover batch files
+    batch_files = sorted(intermediate_dir.glob("batch-*.json"))
+    if not batch_files:
+        print("Error: no batch-*.json files found in intermediate/", file=sys.stderr)
+        sys.exit(1)
+
+    print(f"Found {len(batch_files)} batch files:", file=sys.stderr)
+
+    # Load batches
+    batches: list[dict[str, Any]] = []
+    for f in batch_files:
+        batch = load_batch(f)
+        if batch is not None:
+            batches.append(batch)
+            n = len(batch.get("nodes", []))
+            e = len(batch.get("edges", []))
+            print(f"  {f.name}: {n} nodes, {e} edges", file=sys.stderr)
+
+    if not batches:
+        print("Error: no valid batch files loaded", file=sys.stderr)
+        sys.exit(1)
+
+    # Merge and normalize
+    assembled, report = merge_and_normalize(batches)
+
+    # Print report
+    print("", file=sys.stderr)
+    for line in report:
+        print(line, file=sys.stderr)
+
+    # Write output
+    output_path = intermediate_dir / "assembled-graph.json"
+    output_path.write_text(json.dumps(assembled, indent=2, ensure_ascii=False), encoding="utf-8")
+
+    size_kb = output_path.stat().st_size / 1024
+    print(f"\nWritten to {output_path} ({size_kb:.0f} KB)", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,289 @@
+#!/usr/bin/env python3
+"""
+merge-subdomain-graphs.py — Merge subdomain knowledge-graph files into one.
+
+Auto-discovers *knowledge-graph*.json files in .understand-anything/
+(excluding knowledge-graph.json itself), loads the existing
+knowledge-graph.json as a base if present, and merges everything
+into a single knowledge-graph.json.
+
+Usage:
+    python merge-subdomain-graphs.py <project-root> [file1.json file2.json ...]
+
+If no files are specified, auto-discovers subdomain graphs. The main
+knowledge-graph.json is loaded as a base but never as a discovery input
+(prevents self-merging on repeated runs).
+
+Output:
+    <project-root>/.understand-anything/knowledge-graph.json
+"""
+
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+from typing import Any
+
+
+def load_graph(path: Path) -> dict[str, Any] | None:
+    """Load and minimally validate a knowledge graph JSON file."""
+    try:
+        data = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError) as e:
+        print(f"  Skipping {path.name}: {e}", file=sys.stderr)
+        return None
+
+    # Must have at minimum nodes and edges arrays
+    if not isinstance(data.get("nodes"), list) or not isinstance(data.get("edges"), list):
+        print(f"  Skipping {path.name}: missing nodes or edges array", file=sys.stderr)
+        return None
+
+    return data
+
+
+def merge_graphs(graphs: list[dict[str, Any]]) -> tuple[dict[str, Any], list[str]]:
+    """Merge multiple knowledge graph dicts into one. Returns (merged, report_lines)."""
+
+    # ── Pattern counters for "Fixed" report ──────────────────────────
+    node_dedup_by_type: Counter[str] = Counter()
+
+    # ── Detail lists for "Could not fix" report ──────────────────────
+    unfixable: list[str] = []
+
+    total_input_nodes = sum(len(g.get("nodes", [])) for g in graphs)
+    total_input_edges = sum(len(g.get("edges", [])) for g in graphs)
+
+    # ── Nodes: deduplicate by id, later occurrence wins ───────────────
+    nodes_by_id: dict[str, dict] = {}
+    for g in graphs:
+        for node in g.get("nodes", []):
+            nid = node.get("id")
+            if not nid:
+                unfixable.append(f"Node with no 'id' (name={node.get('name', '?')}, type={node.get('type', '?')})")
+                continue
+            if nid in nodes_by_id:
+                node_type = node.get("type", "?")
+                node_dedup_by_type[node_type] += 1
+            nodes_by_id[nid] = node
+
+    # ── Edges: deduplicate by (source, target, type), higher weight wins
+    edge_dedup_count = 0
+    edges_by_key: dict[tuple[str, str, str], dict] = {}
+    for g in graphs:
+        for edge in g.get("edges", []):
+            key = (edge.get("source", ""), edge.get("target", ""), edge.get("type", ""))
+            existing = edges_by_key.get(key)
+            if existing is None:
+                edges_by_key[key] = edge
+            else:
+                edge_dedup_count += 1
+                if edge.get("weight", 0) > existing.get("weight", 0):
+                    edges_by_key[key] = edge
+
+    # Drop edges referencing missing nodes
+    node_ids = set(nodes_by_id.keys())
+    valid_edges: list[dict] = []
+    for e in edges_by_key.values():
+        src, tgt = e.get("source", ""), e.get("target", "")
+        if src in node_ids and tgt in node_ids:
+            valid_edges.append(e)
+        else:
+            missing = []
+            if src not in node_ids:
+                missing.append(f"source '{src}'")
+            if tgt not in node_ids:
+                missing.append(f"target '{tgt}'")
+            unfixable.append(f"Edge {src} → {tgt} ({e.get('type', '?')}): dropped, missing {', '.join(missing)}")
+
+    # ── Layers: merge by id, union nodeIds ────────────────────────────
+    layers_by_id: dict[str, dict] = {}
+    for g in graphs:
+        for layer in g.get("layers", []):
+            lid = layer.get("id", "")
+            if lid in layers_by_id:
+                existing_ids = set(layers_by_id[lid].get("nodeIds", []))
+                existing_ids.update(layer.get("nodeIds", []))
+                layers_by_id[lid]["nodeIds"] = list(existing_ids)
+            else:
+                layers_by_id[lid] = {**layer}
+
+    # Drop dangling layer nodeIds
+    dropped_layer_refs = 0
+    for layer in layers_by_id.values():
+        before = len(layer.get("nodeIds", []))
+        layer["nodeIds"] = [nid for nid in layer.get("nodeIds", []) if nid in node_ids]
+        diff = before - len(layer["nodeIds"])
+        if diff:
+            dropped_layer_refs += diff
+
+    # ── Tour: concatenate, re-number order ────────────────────────────
+    all_tour_steps: list[dict] = []
+    seen_titles: set[str] = set()
+    for g in graphs:
+        for step in g.get("tour", []):
+            title = step.get("title", "")
+            if title not in seen_titles:
+                seen_titles.add(title)
+                all_tour_steps.append({**step})
+
+    # Drop dangling tour nodeIds and re-number
+    dropped_tour_refs = 0
+    for i, step in enumerate(all_tour_steps, start=1):
+        step["order"] = i
+        before = len(step.get("nodeIds", []))
+        step["nodeIds"] = [nid for nid in step.get("nodeIds", []) if nid in node_ids]
+        diff = before - len(step["nodeIds"])
+        if diff:
+            dropped_tour_refs += diff
+
+    # ── Project metadata: merge ───────────────────────────────────────
+    languages: list[str] = []
+    frameworks: list[str] = []
+    descriptions: list[str] = []
+    latest_at = ""
+    latest_hash = ""
+    project_name = ""
+
+    for g in graphs:
+        proj = g.get("project", {})
+        project_name = proj.get("name", "") or project_name
+        for lang in proj.get("languages", []):
+            if lang not in languages:
+                languages.append(lang)
+        for fw in proj.get("frameworks", []):
+            if fw not in frameworks:
+                frameworks.append(fw)
+        desc = proj.get("description", "")
+        if desc and desc not in descriptions:
+            descriptions.append(desc)
+        analyzed = proj.get("analyzedAt", "")
+        if analyzed > latest_at:
+            latest_at = analyzed
+            latest_hash = proj.get("gitCommitHash", latest_hash)
+
+    # ── Build report ─────────────────────────────────────────────────
+    report: list[str] = []
+    report.append(f"Input: {total_input_nodes} nodes, {total_input_edges} edges (from {len(graphs)} graphs)")
+
+    # Fixed section
+    fixed_lines: list[str] = []
+    if node_dedup_by_type:
+        for ntype, count in node_dedup_by_type.most_common():
+            fixed_lines.append(f"  {count:>4} × duplicate '{ntype}' nodes removed (kept later)")
+    if edge_dedup_count:
+        fixed_lines.append(f"  {edge_dedup_count:>4} × duplicate edges removed (kept higher weight)")
+    if dropped_layer_refs:
+        fixed_lines.append(f"  {dropped_layer_refs:>4} × dangling layer nodeId refs removed")
+    if dropped_tour_refs:
+        fixed_lines.append(f"  {dropped_tour_refs:>4} × dangling tour nodeId refs removed")
+
+    if fixed_lines:
+        total_fixed = sum(node_dedup_by_type.values()) + edge_dedup_count + dropped_layer_refs + dropped_tour_refs
+        report.append("")
+        report.append(f"Fixed ({total_fixed} corrections):")
+        report.extend(fixed_lines)
+
+    # Could not fix section
+    if unfixable:
+        report.append("")
+        report.append(f"Could not fix ({len(unfixable)} issues — needs agent review):")
+        for detail in unfixable:
+            report.append(f"  - {detail}")
+
+    # Output stats
+    report.append("")
+    report.append(f"Output: {len(nodes_by_id)} nodes, {len(valid_edges)} edges, {len(layers_by_id)} layers, {len(all_tour_steps)} tour steps")
+
+    merged: dict[str, Any] = {
+        "version": "1.0.0",
+        "project": {
+            "name": project_name,
+            "languages": languages,
+            "frameworks": frameworks,
+            "description": " | ".join(descriptions) if len(descriptions) > 1 else (descriptions[0] if descriptions else ""),
+            "analyzedAt": latest_at,
+            "gitCommitHash": latest_hash,
+        },
+        "nodes": list(nodes_by_id.values()),
+        "edges": valid_edges,
+        "layers": list(layers_by_id.values()),
+        "tour": all_tour_steps,
+    }
+
+    return merged, report
+
+
+def main() -> None:
+    if len(sys.argv) < 2:
+        print("Usage: python merge-subdomain-graphs.py <project-root> [file1.json file2.json ...]", file=sys.stderr)
+        sys.exit(1)
+
+    project_root = Path(sys.argv[1]).resolve()
+    ua_dir = project_root / ".understand-anything"
+
+    if not ua_dir.is_dir():
+        print(f"Error: {ua_dir} does not exist", file=sys.stderr)
+        sys.exit(1)
+
+    output_path = ua_dir / "knowledge-graph.json"
+
+    # Determine which files to merge
+    if len(sys.argv) > 2:
+        # Explicit file list
+        graph_files = [Path(f).resolve() for f in sys.argv[2:]]
+    else:
+        # Auto-discover subdomain graphs — exclude the main output file
+        # to avoid self-merging on repeated runs
+        graph_files = sorted(
+            p for p in ua_dir.glob("*knowledge-graph*.json")
+            if p.name != "knowledge-graph.json"
+        )
+
+    if not graph_files:
+        print("No subdomain graphs found to merge", file=sys.stderr)
+        sys.exit(0)
+
+    print(f"Found {len(graph_files)} subdomain graphs:", file=sys.stderr)
+    for f in graph_files:
+        print(f"  - {f.name}", file=sys.stderr)
+
+    # Load subdomain graphs
+    graphs: list[dict[str, Any]] = []
+    for f in graph_files:
+        g = load_graph(f)
+        if g is not None:
+            graphs.append(g)
+            node_count = len(g.get("nodes", []))
+            edge_count = len(g.get("edges", []))
+            print(f"    Loaded {f.name}: {node_count} nodes, {edge_count} edges", file=sys.stderr)
+
+    if not graphs:
+        print("Error: no valid subdomain graphs loaded", file=sys.stderr)
+        sys.exit(1)
+
+    # Load the existing main graph as base (if it exists)
+    if output_path.exists():
+        base = load_graph(output_path)
+        if base:
+            node_count = len(base.get("nodes", []))
+            edge_count = len(base.get("edges", []))
+            print(f"    Loaded base knowledge-graph.json: {node_count} nodes, {edge_count} edges", file=sys.stderr)
+            graphs.insert(0, base)  # Base first — subdomain data wins on conflict
+
+    # Merge
+    merged, report = merge_graphs(graphs)
+
+    # Print report
+    print("", file=sys.stderr)
+    for line in report:
+        print(line, file=sys.stderr)
+
+    # Write output
+    output_path.write_text(json.dumps(merged, indent=2, ensure_ascii=False), encoding="utf-8")
+
+    size_kb = output_path.stat().st_size / 1024
+    print(f"\nWritten to {output_path} ({size_kb:.0f} KB)", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()