Merge pull request #204 from Lum1104/feat/semantic-batching-and-output-chunking

fix(#159): semantic batching + bundled importMap + Phase 1 speedup
This commit is contained in:
Yuxiang Lin
2026-05-24 20:12:14 +08:00
committed by GitHub
Unverified
parent 42d70c3f9c
commit a59a573a1d
30 changed files with 12235 additions and 307 deletions
+1 -1
View File
@@ -1,7 +1,7 @@
{
"name": "understand-anything",
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
"version": "2.7.4",
"version": "2.7.5",
"author": {
"name": "Lum1104"
},
+1 -1
View File
@@ -1,7 +1,7 @@
{
"name": "understand-anything",
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
"version": "2.7.4",
"version": "2.7.5",
"author": {
"name": "Lum1104"
},
+1 -1
View File
@@ -2,7 +2,7 @@
"name": "understand-anything",
"displayName": "Understand Anything",
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
"version": "2.7.4",
"version": "2.7.5",
"author": {
"name": "Lum1104"
},
+1 -1
View File
@@ -33,4 +33,4 @@ jobs:
run: pnpm --filter @understand-anything/core test
- name: Test skill
run: pnpm --filter @understand-anything/skill test
run: pnpm test
+1 -1
View File
@@ -35,7 +35,7 @@ An open-source tool combining LLM intelligence + static analysis to produce inte
- `pnpm --filter @understand-anything/core build` — Build the core package
- `pnpm --filter @understand-anything/core test` — Run core tests
- `pnpm --filter @understand-anything/skill build` — Build the plugin package
- `pnpm --filter @understand-anything/skill test` — Run plugin tests
- `pnpm test` — Run all tests (skill tests live at repo-root `tests/skill/`, picked up by root `vitest.config.ts`)
- `pnpm --filter @understand-anything/dashboard build` — Build the dashboard
- `pnpm dev:dashboard` — Start dashboard dev server
- `pnpm lint` — Run ESLint across the project
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,587 @@
# Semantic Batching and Output Chunking Design
**Date:** 2026-05-24
**Status:** Draft
**Branch:** `feat/semantic-batching-and-output-chunking`
**Issue:** [#159](https://github.com/Lum1104/Understand-Anything/issues/159) — Frequently seeing output limit exceeded
---
## Problem
The `/understand` skill's Phase 2 dispatches `file-analyzer` subagents in batches of 20-30 files each (`skills/understand/SKILL.md:282`). Two issues compound on output-constrained LLM backends (notably Bedrock OPUS with default max_tokens of 4096-8192):
1. **Output cap pressure.** Each `file-analyzer` writes one `batch-<N>.json` containing all nodes (file + functions + classes) and edges for its batch. For 25 dense files the JSON content easily exceeds the per-turn `Write(content=...)` token budget. The agent improvises by entering an undefined "minimal output mode" and drops nodes/edges silently. Issue #159 reports this for OPUS on Bedrock at the 100-file scale.
2. **Count-based batching breaks module semantics.** Files are batched by count, not by logical relationship. Files that import each other (and would together form an `auth` module, an `api` module, etc.) get split across batches. The file-analyzer only sees within-batch edges confidently; `calls`/`related`/`inherits`/`implements` edges between modules get dropped at batch boundaries.
The existing `recover_imports_from_scan` in `merge-batch-graphs.py:913` is a deterministic safety net for `imports` edges — but it cannot recover semantic edges (calls / related / inherits / implements). Those are lost.
---
## Goals
- Eliminate "Batch X failed (output limit)" from `/understand` runs on Bedrock OPUS for projects up to 500 files.
- Improve cross-batch semantic edge coverage by replacing count-based batching with Louvain community detection on the import graph.
- Maintain `imports` edge coverage parity (no regression on existing safety net).
- Stay within one PR — defer broader refactors to follow-ups (Section "Out of scope").
## Non-goals
- Refactoring Phase 1 / 2 tree-sitter usage to deduplicate per-batch extraction.
- Adding LLM-generated file summaries to neighborMap.
- Auto-tuning output thresholds per provider.
---
## Architecture
Pipeline before:
```
Phase 1 project-scanner → scan-result.json (files + importMap)
Phase 2 file-analyzer (×N concur) → batch-<i>.json (one per batch; SKILL.md prose batching)
Phase 2末 merge-batch-graphs.py → assembled-graph.json
```
Pipeline after:
```
Phase 1 project-scanner → scan-result.json (unchanged)
Phase 1.5 compute-batches.mjs → batches.json (NEW — semantic batching + neighborMap)
Phase 2 file-analyzer (×N concur) → batch-<i>.json (single) OR batch-<i>-part-<k>.json (split)
Phase 2末 merge-batch-graphs.py → assembled-graph.json (verified, no code change)
```
**Phase 1.5 single responsibility:** topology decision + neighborMap construction. Pure algorithm — reads `scan-result.json`, writes `batches.json`, no LLM calls.
**Phase 2 changes:** SKILL.md stops doing prose batching; iterates `batches.json` and dispatches one file-analyzer per batch.
**file-analyzer changes:** consumes neighborMap; self-checks output size before writing; splits into `batch-<i>-part-<k>.json` when above thresholds.
**merge-batch-graphs.py:** no code changes — the `batch-*.json` glob and sort-key regex already accept multi-part naming. Test fixture and stderr report enhancement added.
---
## Component 1 — `compute-batches.mjs`
**Location:** `understand-anything-plugin/skills/understand/compute-batches.mjs`
**Invocation:** `node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT [--changed-files=<path>]`
**Input:** `$PROJECT_ROOT/.understand-anything/intermediate/scan-result.json`
**Output:** `$PROJECT_ROOT/.understand-anything/intermediate/batches.json`
### Dependencies
Added to `understand-anything-plugin/package.json`:
- `graphology` (~10KB)
- `graphology-communities-louvain` (~30KB)
Reuses `@understand-anything/core`'s `TreeSitterPlugin` and `PluginRegistry` (already imported by `extract-structure.mjs`).
### Algorithm
```
1. Load scan-result.json.
2. Partition files by fileCategory:
- codeFiles = files where fileCategory === "code"
- nonCodeFiles = the rest
3. Code batching (Louvain on import graph):
a. Build undirected graph: nodes = codeFiles, edges = importMap relations
(weight=1, undirected so import and imported-by both count).
b. Run graphology-communities-louvain → community assignment per file.
c. For any community with size > 35 (max): split via edge-betweenness greedy
cut (or simpler weakly-connected-component partition) until each
sub-community ≤ 35. Log warning per split.
(Whether this branch fires is decided by the implementation prototype
step — see "Prototype-first implementation" below.)
d. Communities with size < 5 are kept as-is. Wasted dispatches are
bounded by the 5-concurrent cap, and the alternative ("merge small")
adds edge cases without proportional value.
4. Non-code batching (hardcoded heuristics, moved from SKILL.md prose):
- Group A: For each directory containing a `Dockerfile`, bundle that
directory's `Dockerfile` + any `docker-compose.*` + any
`.dockerignore` → one batch per such directory (so multi-service
repos with several Dockerfiles get one batch per service).
- Group B: `.github/workflows/*.yml` files → one batch.
- Group C: `.gitlab-ci.yml` + files under `.circleci/` → one batch.
- Group D: SQL files under any `migrations/` or `migration/` directory,
sorted by filename → one batch per directory.
- Group E: All other non-code files grouped by their immediate parent
directory, max 20 per batch.
5. Assign batchIndex: code communities first (1..N), non-code groups
second (N+1..M).
6. Exports extraction:
- For each code file, run TreeSitterPlugin.extract() and collect
top-level exports (function names, class names, exported const names).
- Per-file failures: catch, set exports = [], emit warning.
- Non-code files: exports = [].
7. Construct neighborMap (1-hop):
For each file F in batch B:
neighborMap[F.path] = [
{ path: G.path, batchIndex: G.batch, symbols: G.exports }
for G in importMap[F.path] reverseImportMap[F.path]
where G.batch ≠ B
]
If neighborMap[F.path].length > 50, truncate to top 50 by neighbor
degree (highest-imported neighbors kept), emit warning.
8. Construct batchImportData:
For each batch B:
batchImportData[F.path] = importMap[F.path] for F in B.files
9. Write batches.json.
Fallback (script-internal): If steps 3a-3c throw, catch → emit warning
→ assign batches by alphabetical chunking (12 files per code batch).
Steps 4, 6, 7, 8 still run normally. Set `algorithm: "count-fallback"`
in the output.
```
### Louvain implementation
Use `graphology-communities-louvain`'s default modularity-greedy algorithm:
```js
import Graph from 'graphology';
import louvain from 'graphology-communities-louvain';
const graph = new Graph({ type: 'undirected' });
for (const file of codeFiles) graph.addNode(file.path);
for (const [src, targets] of Object.entries(importMap)) {
for (const tgt of targets) {
if (graph.hasNode(src) && graph.hasNode(tgt) && !graph.hasEdge(src, tgt)) {
graph.addEdge(src, tgt);
}
}
}
const communities = louvain(graph); // { nodeId: communityId }
```
### Output schema (`batches.json`)
```json
{
"schemaVersion": 1,
"algorithm": "louvain",
"totalFiles": 100,
"totalBatches": 7,
"batches": [
{
"batchIndex": 1,
"files": [
{ "path": "src/auth/login.ts", "language": "typescript",
"sizeLines": 120, "fileCategory": "code" }
],
"batchImportData": {
"src/auth/login.ts": ["src/auth/session.ts", "src/db/users.ts"]
},
"neighborMap": {
"src/auth/login.ts": [
{ "path": "src/db/users.ts", "batchIndex": 3,
"symbols": ["User", "findById", "createUser"] }
]
}
}
]
}
```
`algorithm` is `"louvain"` on the happy path, `"count-fallback"` when the Louvain branch crashed.
### `--changed-files` mode
When invoked with `--changed-files=<path>`, the script:
- Loads file paths from `<path>` (one per line).
- Still builds the full project import graph (for accurate neighborMap construction).
- Only emits batches containing changed files.
- neighborMap entries reference unchanged files with their batchIndex from the deterministic full-graph Louvain re-run. The seed is fixed so the assignment is reproducible across incremental invocations.
### Prototype-first implementation
Before writing the full script, build a minimal skeleton:
1. Load `scan-result.json` from this repo's `.understand-anything/` directory (if absent, generate via `/understand --full`).
2. Run Louvain only — no size enforcement, no neighborMap.
3. Print community size distribution.
4. Decide: do real-world communities cluster in [5, 35]? If yes, size enforcement branch may be unnecessary or trivially defensive. If no, implement edge-betweenness split.
This gates the more speculative code (size enforcement) on empirical observation rather than upfront design.
---
## Component 2 — `skills/understand/SKILL.md` changes
### Add — Phase 1.5 section (after Phase 1)
```markdown
## Phase 1.5 — BATCH
Report: `[Phase 1.5/7] Computing semantic batches...`
Run the bundled batching script:
\`\`\`bash
node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT
\`\`\`
Reads `.understand-anything/intermediate/scan-result.json`, writes
`.understand-anything/intermediate/batches.json`.
Capture stderr. Append any line starting with `Warning:` to
$PHASE_WARNINGS for the final report.
If the script exits non-zero, the failure is hard — relay the full
stderr to the user as a Phase 1.5 failure. Do not attempt to recover;
the script's internal fallback (count-based) already handles recoverable
issues. A non-zero exit means a fundamental problem (missing input file,
malformed JSON, etc.).
```
### Replace — Phase 2 ANALYZE section (current SKILL.md:280-332)
Delete the existing "Batch the file list from Phase 1 into groups of 20-30 files each" prose, the non-code grouping prose (now in compute-batches), and the dispatch-time `batchImportData` construction prose (now provided in batches.json). Replace with:
```markdown
## Phase 2 — ANALYZE
### Full analysis path
Load `.understand-anything/intermediate/batches.json` (produced by
Phase 1.5). Iterate the `batches[]` array.
Report: `[Phase 2/7] Analyzing files — <totalFiles> files in
<totalBatches> batches (up to 5 concurrent)...`
For each batch, dispatch a `file-analyzer` subagent (up to 5
concurrent). Dispatch prompt template:
> Analyze these files and produce GraphNode and GraphEdge objects.
> Project root: `$PROJECT_ROOT`
> Project: `<projectName>`
> Languages: `<languages>`
> Batch: `<batchIndex>/<totalBatches>`
> Skill directory: `<SKILL_DIR>`
> Output: write to
> `$PROJECT_ROOT/.understand-anything/intermediate/batch-<batchIndex>.json`
> (single-file mode) OR `batch-<batchIndex>-part-<k>.json` (split mode,
> per Step B of your output protocol).
>
> Pre-resolved import data (use directly — do NOT re-resolve from source):
> \`\`\`json
> <batchImportData JSON inline from batches.json[i].batchImportData>
> \`\`\`
>
> Cross-batch neighbors with their exported symbols (confidence boost
> for cross-batch edges):
> \`\`\`json
> <neighborMap JSON inline from batches.json[i].neighborMap>
> \`\`\`
>
> Files to analyze:
> 1. `<path>` (<sizeLines> lines, language: `<language>`,
> fileCategory: `<fileCategory>`)
> ...
$LANGUAGE_DIRECTIVE
After ALL batches complete, run the merge-and-normalize script:
\`\`\`bash
python <SKILL_DIR>/merge-batch-graphs.py $PROJECT_ROOT
\`\`\`
(Rest of Phase 2 unchanged.)
```
### Replace — Incremental update path (current SKILL.md:355-366)
```markdown
### Incremental update path
Run compute-batches.mjs with `--changed-files=<path>`, where `<path>`
is a temp file listing changed file paths (one per line). The script
reuses the full project's import graph for neighborMap computation
but only emits batches containing changed files. Dispatch file-analyzer
subagents per the same template as the full path.
```
### Line budget
Net added LLM-context prose: Phase 1.5 (~12 lines) + Phase 2 template clarifications (~5 lines) removed batching prose (~15 lines) removed batchImportData construction prose (~6 lines) ≈ **4 lines**.
---
## Component 3 — `agents/file-analyzer.md` changes
### Add — Cross-batch context section
Insert after "Step 1: Input file construction":
```markdown
### Cross-batch context (neighborMap)
Your dispatch prompt includes a `neighborMap` — for each file in your
batch, it lists project-internal neighbors in OTHER batches (files that
import yours or that you import), with their exported symbols.
Use neighborMap as a confidence boost for cross-batch edges (`calls`,
`related`, `inherits`, `implements` to nodes outside your batch):
- If your source clearly references a symbol that appears in some
`neighbor.symbols`, emit the edge to
`function:<neighbor.path>:<symbol>` or
`class:<neighbor.path>:<symbol>` with confidence.
- If your source references a cross-batch symbol that is NOT in
neighborMap (the project-scanner may not have extracted it), you may
still emit the edge if you saw it explicitly in the imported file's
surface — but prefer matching neighborMap symbols when available.
- Imports continue to use `batchImportData` (fully resolved), not
neighborMap.
The merge script's dangling-edge dropper is the safety net for
genuinely unresolvable targets.
```
### Replace — Writing Results section (current file-analyzer.md:467-475)
```markdown
## Writing Results — single or multi-part
**Step A — Compute totals.**
\`\`\`
nodeCount = nodes.length
edgeCount = edges.length
\`\`\`
**Step B — Decide split.**
- If `nodeCount ≤ 60` AND `edgeCount ≤ 120`: write ONE file to
`.understand-anything/intermediate/batch-<batchIndex>.json`. Done.
Skip to Step E.
- Otherwise: `parts = ceil(max(nodeCount / 60, edgeCount / 120))`.
**Step C — Partition.**
Sort files in your batch alphabetically by path. Chunk them sequentially
into `parts` groups of size `ceil(N / parts)`. For each part:
- All nodes whose `filePath` is in this part's files (for non-file
nodes like `module`/`concept`, use the file they belong to).
- All edges whose `source` is in this part's nodes (target may be
anywhere — same part, different part of same batch, different batch).
**Step D — Write each part.**
Write part `k` (1-indexed) to
`.understand-anything/intermediate/batch-<batchIndex>-part-<k>.json`.
Each part is a valid GraphFragment: `{ "nodes": [...], "edges": [...] }`.
**Step E — Self-validate.**
For each file written, verify:
- Valid JSON.
- `nodes` array exists and is well-formed.
- For every edge: `source` and `target` both appear as either (a) a
node `id` in this part's nodes, OR (b) a `file:<path>` reference
where `<path>` is in `neighborMap` or `batchImportData`, OR (c) a
`function:<path>:<symbol>` / `class:<path>:<symbol>` reference where
`<symbol>` is in some `neighbor.symbols`.
If validation fails on a part, do NOT silently rebuild. Respond with
an explicit error stating which part failed, which edge(s) failed
validation, and why. The dispatching session can then retry.
**Step F — Respond.**
Respond with ONLY a brief text summary: parts written (1 or more),
total nodes/edges across all parts, any files skipped. Do NOT include
JSON content in the response.
```
### Threshold rationale
`60 nodes / 120 edges per part` derives from:
- File node JSON serialized ≈ 150-300 chars; function/class ≈ 80-150 chars; edge ≈ 100-150 chars.
- 60 nodes + 120 edges ≈ 25-35KB JSON ≈ 7000-9000 output tokens (JSON tokenization is dense).
- Bedrock OPUS default `max_tokens` 4096-8192 → ~10% safety margin.
These constants live as file-analyzer.md prose for now. Auto-tuning per provider is deferred to follow-up.
---
## Component 4 — `merge-batch-graphs.py` (verify-only)
### Confirmed compatibility
The existing glob and sort-key already handle multi-part files transparently:
- `intermediate_dir.glob("batch-*.json")` matches `batch-3-part-1.json`.
- `re.search(r"batch-(\d+)", p.stem)` extracts `3` from `batch-3-part-1`, giving the same sort key as `batch-3.json`. Python `sorted` is stable, so parts load in lexicographic tie-break order.
- `merge_and_normalize` walks `all_nodes.extend(...)` / `all_edges.extend(...)`; load order does not affect dedup correctness.
- `recover_imports_from_scan` operates on the merged graph — transparent to multi-part inputs.
- `link_tests` operates on the merged node pool — transparent.
No code change required for correctness.
### Add — Multi-part awareness in stderr report
`merge-batch-graphs.py:1026` currently prints `Found {N} batch files:`. Enhance:
```python
from collections import defaultdict
by_batch = defaultdict(list)
for f in batch_files:
m = re.match(r"batch-(\d+)(?:-part-(\d+))?\.json", f.name)
if m:
by_batch[int(m.group(1))].append(f.name)
logical_count = len(by_batch)
multi_part = sum(1 for files in by_batch.values() if len(files) > 1)
print(
f"Found {len(batch_files)} batch files "
f"({logical_count} logical batches, {multi_part} multi-part)",
file=sys.stderr,
)
```
### Add — Missing-part warning
After grouping, detect logical batches with non-contiguous part numbers (e.g. parts `{2, 3}` present but `1` missing) and emit:
```
Warning: merge: batch <i> has parts {<set>} but missing part {<missing>}
— possible truncated write — affected nodes/edges may be lost
```
---
## Failure modes & observability
| Failure point | Behavior | Safety net | Required warning text |
|---|---|---|---|
| Louvain library throws | exception | Script-internal: catch → count-based fallback (12 files/batch); neighborMap still built | `Warning: compute-batches: Louvain failed (<msg>) — falling back to count-based grouping (12 files/batch) — module semantic boundaries lost` |
| tree-sitter exports per-file failure | empty exports | symbols=[] in neighborMap | `Warning: compute-batches: exports extraction failed for <path> (<msg>) — symbols=[] in neighborMap — cross-batch edges to this file limited to file-level` |
| Louvain produces oversized community | size > 35 | Edge-betweenness split | `Warning: compute-batches: community size <N> > max 35 — splitting via edge-betweenness — modularity may decrease` |
| compute-batches complete crash | exit non-zero, no batches.json | SKILL.md surfaces full stderr to user; no Phase 2 fallback | (script's own error to stderr; SKILL.md relays verbatim) |
| neighborMap truncation | > 50 neighbors | Top-50 by degree kept | `Warning: compute-batches: neighborMap for <path> truncated from <N> to top 50 (by neighbor degree)` |
| file-analyzer part JSON malformed | `load_batch` skips | Existing `load_batch:139` warns and skips | (existing — verify the warning is not swallowed) |
| Missing part in multi-part batch | gap in parts | merge detects and warns | `Warning: merge: batch <i> has parts {<set>} but missing part {<missing>} — possible truncated write — affected nodes/edges may be lost` |
| file-analyzer dangling edges | source/target missing | merge drops, adds to `unfixable` (existing) | (existing) |
| file-analyzer dispatch fails | subagent error | existing retry-once mechanism | (existing) |
### Observability invariant
Every fallback / degrade / drop MUST:
1. Write a stderr line in `Warning: <component>: <what happened> — <why> — <impact>` format.
2. Bubble up to `$PHASE_WARNINGS` (SKILL.md existing mechanism) → user-facing Phase 7 final report.
3. Never use silent `catch {}` / `except: pass`. Code review treats this as a blocker.
### Invariants
1. **scan-result.json is source of truth.** Any batching/topology change preserves importMap; `recover_imports_from_scan` always restores `imports` edges.
2. **Dangling-edge dropper is final defense.** No batch-generated edge can connect to a nonexistent node in the assembled graph.
3. **No silent fallback.** `batches.json` missing → loud failure. Internal compute-batches fallback → loud warning that bubbles to user.
---
## Testing
### Unit tests — `compute-batches.mjs`
New file: `understand-anything-plugin/skills/understand/test_compute_batches.test.mjs` (Vitest).
Required cases:
- **Louvain basic:** 3 disjoint cliques → 3 batches.
- **Empty importMap:** independent files → count-fallback batches by alphabetical chunking.
- **Oversized community:** 50-node complete graph → split triggered, all sub-batches ≤ 35.
- **Non-code grouping A:** `Dockerfile` + `docker-compose.yml` + `.dockerignore` siblings → one batch per directory cluster.
- **Non-code grouping B:** `.github/workflows/*.yml` → one batch.
- **Non-code grouping C:** SQL migrations under `migrations/` → one batch per directory.
- **Mixed code + non-code:** non-code batchIndex follows code batches.
- **neighborMap correctness:** file A imports file B across batches → `neighborMap[A]` contains `{path: B, batchIndex: B's, symbols: B's exports}`.
- **neighborMap excludes same-batch:** A and C in same batch → `neighborMap[A]` does not contain C.
- **Exports failure tolerance:** mock TreeSitter to throw on one file → `exports = []` for that file, others unaffected.
- **`--changed-files`:** input subset → output contains only batches with changed files; neighborMap may reference unchanged files.
- **Fallback triggers:** mock Louvain throw → `algorithm` field = `"count-fallback"`, warning in stderr.
- **Warning assertion per fallback:** for each of {Louvain crash, exports failure, oversize split, neighborMap truncation}, assert the exact warning string appears in stderr.
### Unit tests — `merge-batch-graphs.py`
New test class `TestMultiPart` in `test_merge_batch_graphs.py`:
- Two parts of one logical batch: `batch-1-part-1.json` + `batch-1-part-2.json` → assembled contains all nodes/edges from both.
- Three parts of one logical batch.
- Cross-part edges: edge with source in part-1, target node in part-2 → connected after merge.
- Malformed part-1 + valid part-2: part-1 skipped with warning, part-2 contents present.
- Mixed single-batch and multi-part inputs.
- Missing part detection: `batch-1-part-2.json` + `batch-1-part-3.json` (no part-1) → warning emitted with exact text.
- stderr format: assert `"X logical batches, Y multi-part"` appears.
### Integration — PR acceptance gate (manual)
Documented in the PR's Test plan:
- [ ] `pnpm install` (graphology installs cleanly).
- [ ] `pnpm --filter @understand-anything/core build`.
- [ ] Run `/understand --full` on this repo (Understand-Anything itself):
- `batches.json` generated; community size distribution sanity-check (mix of small and medium batches).
- At least one batch produces multi-part output.
- `assembled-graph.json` node/edge counts within expected range vs current main.
- Dashboard renders normally.
- Phase 7 final report includes any `$PHASE_WARNINGS` from compute-batches (visually verify warnings reach user-facing output, not just stderr).
- [ ] Run on a ~100-file repo matching ayushghosh's scenario; confirm no "output limit" errors.
- [ ] Run on a 5-10 file small repo: fallback path (all one batch) works correctly.
### Not tested
- Louvain algorithm correctness (trust `graphology-communities-louvain`'s own tests).
- Performance benchmarks (sub-second on 100-500 files is empirical; not gated).
- Multiple LLM provider output-cap variations (thresholds are conservative for Bedrock OPUS; first-party Anthropic is more permissive).
---
## Out of scope (tracked for follow-up)
### Tree-sitter deduplication
Currently Phase 1 (project-scanner), Phase 1.5 (compute-batches), and Phase 2 (file-analyzer per-batch) each run tree-sitter independently. Consolidating into a single Phase 1.5 structure extraction would simplify file-analyzer and save time on large projects. Defer because it requires reorganizing file-analyzer's protocol significantly.
### neighborMap LLM summaries
Adding one-sentence summaries per file to neighborMap would enable file-analyzer to emit `related` edges across batches with semantic justification. Requires a new lightweight summary-pass agent; defer until the tree-sitter dedup lands (Phase 1.5 will already have full structure → cheaper to add).
### Adaptive thresholds
`60 nodes / 120 edges` are conservative for Bedrock OPUS. Anthropic first-party supports much larger output caps. Adding a `--output-cap=<N>` CLI to compute-batches and propagating to file-analyzer would unlock larger parts on permissive backends. Track real-world part counts before implementing.
### Cross-batch edge audit
A post-merge audit comparing neighborMap-suggested edges vs actually-emitted edges would surface gaps. Mirror the existing `recover_imports_from_scan` pattern. Requires preserving `batches.json` for merge-time consumption.
### Multi-language monorepo handling
Multi-language repos (TS + Python) tend to naturally split via Louvain (no cross-language imports). Bridge files (OpenAPI, protobuf) might create odd communities. Address only if real reports surface.
---
## Implementation order
1. **Prototype:** minimal `compute-batches.mjs` skeleton — load scan-result.json, run Louvain, print community sizes. Run against this repo's `scan-result.json` (generate if missing via `/understand --full`). Decide whether size-enforcement branch is needed; if needed, choose between edge-betweenness and weakly-connected-component split.
2. Add exports extraction (reuse TreeSitterPlugin).
3. Add neighborMap construction + batchImportData passthrough.
4. Add non-code grouping heuristics (Groups A-E).
5. Add fallback path + warning emissions for every failure mode listed in the Failure modes table.
6. Write unit tests for compute-batches (per Testing section), including warning-text assertions.
7. Modify `agents/file-analyzer.md` — add Cross-batch context section, replace Writing Results.
8. Modify `skills/understand/SKILL.md` — add Phase 1.5, replace Phase 2 ANALYZE batching prose, replace incremental path.
9. Add multi-part stderr report + missing-part warning to `merge-batch-graphs.py`.
10. Write unit tests for `merge-batch-graphs.py` multi-part handling.
11. Add `graphology` + `graphology-communities-louvain` to `understand-anything-plugin/package.json`.
12. Run integration acceptance gate.
13. Bump version in all five `package.json` / `plugin.json` files per the project's CLAUDE.md versioning rule.
+1 -1
View File
@@ -7,7 +7,7 @@
"scripts": {
"prepare": "pnpm --filter @understand-anything/core build",
"build": "pnpm -r build",
"test": "vitest",
"test": "vitest run",
"dev:dashboard": "pnpm --filter @understand-anything/dashboard dev",
"lint": "eslint ."
},
+16
View File
@@ -38,6 +38,12 @@ importers:
'@understand-anything/core':
specifier: workspace:*
version: link:packages/core
graphology:
specifier: ~0.26.0
version: 0.26.0(graphology-types@0.24.8)
graphology-communities-louvain:
specifier: ^2.0.2
version: 2.0.2(graphology-types@0.24.8)
devDependencies:
'@types/node':
specifier: ^22.0.0
@@ -1861,6 +1867,11 @@ packages:
peerDependencies:
graphology-types: '>=0.24.0'
graphology@0.26.0:
resolution: {integrity: sha512-8SSImzgUUYC89Z042s+0r/vMibY7GX/Emz4LDO5e7jYXhuoWfHISPFJYjpRLUSJGq6UQ6xlenvX1p/hJdfXuXg==}
peerDependencies:
graphology-types: '>=0.24.0'
h3@1.15.11:
resolution: {integrity: sha512-L3THSe2MPeBwgIZVSH5zLdBBU90TOxarvhK9d04IDY2AmVS8j2Jz2LIWtwsGOU3lu2I5jCN7FNvVfY2+XyF+mg==}
@@ -4966,6 +4977,11 @@ snapshots:
graphology-types: 0.24.8
obliterator: 2.0.5
graphology@0.26.0(graphology-types@0.24.8):
dependencies:
events: 3.3.0
graphology-types: 0.24.8
h3@1.15.11:
dependencies:
cookie-es: 1.2.3
@@ -0,0 +1,31 @@
{
"name": "fixture-3-cliques",
"description": "Three disjoint import cliques for Louvain testing",
"languages": ["typescript"],
"frameworks": [],
"files": [
{"path": "src/auth/login.ts", "language": "typescript", "sizeLines": 50, "fileCategory": "code"},
{"path": "src/auth/session.ts", "language": "typescript", "sizeLines": 40, "fileCategory": "code"},
{"path": "src/auth/tokens.ts", "language": "typescript", "sizeLines": 60, "fileCategory": "code"},
{"path": "src/api/handlers.ts", "language": "typescript", "sizeLines": 80, "fileCategory": "code"},
{"path": "src/api/middleware.ts", "language": "typescript", "sizeLines": 30, "fileCategory": "code"},
{"path": "src/api/routes.ts", "language": "typescript", "sizeLines": 45, "fileCategory": "code"},
{"path": "src/db/users.ts", "language": "typescript", "sizeLines": 70, "fileCategory": "code"},
{"path": "src/db/queries.ts", "language": "typescript", "sizeLines": 55, "fileCategory": "code"},
{"path": "src/db/migrations.ts", "language": "typescript", "sizeLines": 35, "fileCategory": "code"}
],
"totalFiles": 9,
"filteredByIgnore": 0,
"estimatedComplexity": "small",
"importMap": {
"src/auth/login.ts": ["src/auth/session.ts", "src/auth/tokens.ts"],
"src/auth/session.ts": ["src/auth/tokens.ts"],
"src/auth/tokens.ts": [],
"src/api/handlers.ts": ["src/api/middleware.ts", "src/api/routes.ts"],
"src/api/middleware.ts": ["src/api/routes.ts", "src/auth/session.ts"],
"src/api/routes.ts": [],
"src/db/users.ts": ["src/db/queries.ts", "src/db/migrations.ts"],
"src/db/queries.ts": ["src/db/migrations.ts"],
"src/db/migrations.ts": []
}
}
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,233 @@
{
"name": "fixture-merge-respects-non-mergeable",
"description": "Regression guard for mergeSmallBatches: a small non-mergeable batch (Dockerfile cluster, marked mergeable=false by buildNonCodeBatches Group A) must NOT be pooled into the misc bucket alongside isolated code singletons, even though its size (1) is well below MIN_BATCH_SIZE=3. Pooling Dockerfiles into misc would destroy the semantic atom — an LLM analyzing the misc batch loses the per-service infra context.",
"languages": [
"typescript",
"dockerfile"
],
"frameworks": [],
"files": [
{
"path": "src/leaf000.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf001.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf002.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf003.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf004.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf005.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf006.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf007.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf008.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf009.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf010.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf011.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf012.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf013.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf014.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf015.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf016.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf017.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf018.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf019.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf020.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf021.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf022.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf023.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf024.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf025.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf026.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf027.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf028.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf029.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "services/api/Dockerfile",
"language": "dockerfile",
"sizeLines": 18,
"fileCategory": "infra"
}
],
"totalFiles": 31,
"filteredByIgnore": 0,
"estimatedComplexity": "moderate",
"importMap": {
"src/leaf000.ts": [],
"src/leaf001.ts": [],
"src/leaf002.ts": [],
"src/leaf003.ts": [],
"src/leaf004.ts": [],
"src/leaf005.ts": [],
"src/leaf006.ts": [],
"src/leaf007.ts": [],
"src/leaf008.ts": [],
"src/leaf009.ts": [],
"src/leaf010.ts": [],
"src/leaf011.ts": [],
"src/leaf012.ts": [],
"src/leaf013.ts": [],
"src/leaf014.ts": [],
"src/leaf015.ts": [],
"src/leaf016.ts": [],
"src/leaf017.ts": [],
"src/leaf018.ts": [],
"src/leaf019.ts": [],
"src/leaf020.ts": [],
"src/leaf021.ts": [],
"src/leaf022.ts": [],
"src/leaf023.ts": [],
"src/leaf024.ts": [],
"src/leaf025.ts": [],
"src/leaf026.ts": [],
"src/leaf027.ts": [],
"src/leaf028.ts": [],
"src/leaf029.ts": [],
"services/api/Dockerfile": []
}
}
@@ -0,0 +1,38 @@
{
"name": "fixture-non-code",
"description": "Mix of non-code files exercising Groups A-E. The src/ clique has 3 mutually-importing files so it survives merge-small (size >= MIN_BATCH_SIZE=3) and stays a pure-code batch — required by the 'non-code batch indices follow code batches' assertion.",
"languages": ["typescript", "dockerfile", "yaml", "sql", "markdown"],
"frameworks": [],
"files": [
{"path": "src/index.ts", "language": "typescript", "sizeLines": 10, "fileCategory": "code"},
{"path": "src/server.ts", "language": "typescript", "sizeLines": 15, "fileCategory": "code"},
{"path": "src/router.ts", "language": "typescript", "sizeLines": 12, "fileCategory": "code"},
{"path": "Dockerfile", "language": "dockerfile", "sizeLines": 20, "fileCategory": "infra"},
{"path": "docker-compose.yml", "language": "yaml", "sizeLines": 15, "fileCategory": "infra"},
{"path": ".dockerignore", "language": "config", "sizeLines": 5, "fileCategory": "config"},
{"path": "services/api/Dockerfile", "language": "dockerfile", "sizeLines": 18, "fileCategory": "infra"},
{"path": "services/api/docker-compose.yml", "language": "yaml", "sizeLines": 12, "fileCategory": "infra"},
{"path": ".github/workflows/ci.yml", "language": "yaml", "sizeLines": 30, "fileCategory": "infra"},
{"path": ".github/workflows/deploy.yml", "language": "yaml", "sizeLines": 25, "fileCategory": "infra"},
{"path": ".gitlab-ci.yml", "language": "yaml", "sizeLines": 20, "fileCategory": "infra"},
{"path": ".circleci/config.yml", "language": "yaml", "sizeLines": 25, "fileCategory": "infra"},
{"path": "migrations/001_init.sql", "language": "sql", "sizeLines": 40, "fileCategory": "data"},
{"path": "migrations/002_users.sql", "language": "sql", "sizeLines": 20, "fileCategory": "data"},
{"path": "docs/getting-started.md", "language": "markdown", "sizeLines": 100, "fileCategory": "docs"},
{"path": "README.md", "language": "markdown", "sizeLines": 200, "fileCategory": "docs"}
],
"totalFiles": 16,
"filteredByIgnore": 0,
"estimatedComplexity": "small",
"importMap": {
"src/index.ts": ["src/server.ts", "src/router.ts"],
"src/server.ts": ["src/router.ts"],
"src/router.ts": [],
"Dockerfile": [], "docker-compose.yml": [], ".dockerignore": [],
"services/api/Dockerfile": [], "services/api/docker-compose.yml": [],
".github/workflows/ci.yml": [], ".github/workflows/deploy.yml": [],
".gitlab-ci.yml": [], ".circleci/config.yml": [],
"migrations/001_init.sql": [], "migrations/002_users.sql": [],
"docs/getting-started.md": [], "README.md": []
}
}
@@ -0,0 +1,715 @@
{
"name": "fixture-singletons",
"description": "100 isolated TS files that should merge into ~4 misc batches",
"languages": [
"typescript"
],
"frameworks": [],
"files": [
{
"path": "src/leaf000.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf001.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf002.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf003.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf004.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf005.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf006.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf007.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf008.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf009.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf010.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf011.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf012.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf013.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf014.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf015.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf016.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf017.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf018.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf019.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf020.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf021.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf022.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf023.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf024.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf025.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf026.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf027.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf028.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf029.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf030.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf031.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf032.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf033.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf034.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf035.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf036.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf037.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf038.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf039.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf040.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf041.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf042.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf043.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf044.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf045.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf046.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf047.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf048.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf049.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf050.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf051.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf052.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf053.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf054.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf055.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf056.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf057.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf058.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf059.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf060.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf061.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf062.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf063.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf064.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf065.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf066.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf067.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf068.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf069.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf070.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf071.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf072.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf073.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf074.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf075.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf076.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf077.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf078.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf079.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf080.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf081.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf082.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf083.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf084.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf085.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf086.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf087.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf088.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf089.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf090.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf091.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf092.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf093.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf094.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf095.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf096.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf097.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf098.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
},
{
"path": "src/leaf099.ts",
"language": "typescript",
"sizeLines": 10,
"fileCategory": "code"
}
],
"totalFiles": 100,
"filteredByIgnore": 0,
"estimatedComplexity": "moderate",
"importMap": {
"src/leaf000.ts": [],
"src/leaf001.ts": [],
"src/leaf002.ts": [],
"src/leaf003.ts": [],
"src/leaf004.ts": [],
"src/leaf005.ts": [],
"src/leaf006.ts": [],
"src/leaf007.ts": [],
"src/leaf008.ts": [],
"src/leaf009.ts": [],
"src/leaf010.ts": [],
"src/leaf011.ts": [],
"src/leaf012.ts": [],
"src/leaf013.ts": [],
"src/leaf014.ts": [],
"src/leaf015.ts": [],
"src/leaf016.ts": [],
"src/leaf017.ts": [],
"src/leaf018.ts": [],
"src/leaf019.ts": [],
"src/leaf020.ts": [],
"src/leaf021.ts": [],
"src/leaf022.ts": [],
"src/leaf023.ts": [],
"src/leaf024.ts": [],
"src/leaf025.ts": [],
"src/leaf026.ts": [],
"src/leaf027.ts": [],
"src/leaf028.ts": [],
"src/leaf029.ts": [],
"src/leaf030.ts": [],
"src/leaf031.ts": [],
"src/leaf032.ts": [],
"src/leaf033.ts": [],
"src/leaf034.ts": [],
"src/leaf035.ts": [],
"src/leaf036.ts": [],
"src/leaf037.ts": [],
"src/leaf038.ts": [],
"src/leaf039.ts": [],
"src/leaf040.ts": [],
"src/leaf041.ts": [],
"src/leaf042.ts": [],
"src/leaf043.ts": [],
"src/leaf044.ts": [],
"src/leaf045.ts": [],
"src/leaf046.ts": [],
"src/leaf047.ts": [],
"src/leaf048.ts": [],
"src/leaf049.ts": [],
"src/leaf050.ts": [],
"src/leaf051.ts": [],
"src/leaf052.ts": [],
"src/leaf053.ts": [],
"src/leaf054.ts": [],
"src/leaf055.ts": [],
"src/leaf056.ts": [],
"src/leaf057.ts": [],
"src/leaf058.ts": [],
"src/leaf059.ts": [],
"src/leaf060.ts": [],
"src/leaf061.ts": [],
"src/leaf062.ts": [],
"src/leaf063.ts": [],
"src/leaf064.ts": [],
"src/leaf065.ts": [],
"src/leaf066.ts": [],
"src/leaf067.ts": [],
"src/leaf068.ts": [],
"src/leaf069.ts": [],
"src/leaf070.ts": [],
"src/leaf071.ts": [],
"src/leaf072.ts": [],
"src/leaf073.ts": [],
"src/leaf074.ts": [],
"src/leaf075.ts": [],
"src/leaf076.ts": [],
"src/leaf077.ts": [],
"src/leaf078.ts": [],
"src/leaf079.ts": [],
"src/leaf080.ts": [],
"src/leaf081.ts": [],
"src/leaf082.ts": [],
"src/leaf083.ts": [],
"src/leaf084.ts": [],
"src/leaf085.ts": [],
"src/leaf086.ts": [],
"src/leaf087.ts": [],
"src/leaf088.ts": [],
"src/leaf089.ts": [],
"src/leaf090.ts": [],
"src/leaf091.ts": [],
"src/leaf092.ts": [],
"src/leaf093.ts": [],
"src/leaf094.ts": [],
"src/leaf095.ts": [],
"src/leaf096.ts": [],
"src/leaf097.ts": [],
"src/leaf098.ts": [],
"src/leaf099.ts": []
}
}
@@ -0,0 +1,602 @@
import { describe, it, expect, beforeEach, afterEach } from 'vitest';
import { mkdtempSync, mkdirSync, writeFileSync, readFileSync, rmSync } from 'node:fs';
import { tmpdir } from 'node:os';
import { join } from 'node:path';
import { spawnSync } from 'node:child_process';
import { fileURLToPath } from 'node:url';
import { dirname, resolve } from 'node:path';
const __dirname = dirname(fileURLToPath(import.meta.url));
const SCRIPT = resolve(__dirname, '../../../understand-anything-plugin/skills/understand/compute-batches.mjs');
const FIXTURES = resolve(__dirname, 'fixtures');
function runScript(projectRoot, extraArgs = []) {
return spawnSync('node', [SCRIPT, projectRoot, ...extraArgs], {
encoding: 'utf-8',
});
}
function setupProject(fixtureName) {
const root = mkdtempSync(join(tmpdir(), 'ua-cb-test-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
const fixturePath = join(FIXTURES, fixtureName);
const dest = join(root, '.understand-anything', 'intermediate', 'scan-result.json');
writeFileSync(dest, readFileSync(fixturePath, 'utf-8'));
return root;
}
function readBatches(projectRoot) {
const p = join(projectRoot, '.understand-anything', 'intermediate', 'batches.json');
return JSON.parse(readFileSync(p, 'utf-8'));
}
describe('compute-batches.mjs — Louvain basic', () => {
let projectRoot;
beforeEach(() => {
projectRoot = setupProject('scan-result-3-cliques.json');
});
afterEach(() => {
if (projectRoot) rmSync(projectRoot, { recursive: true, force: true });
});
it('produces 3 batches for 3 disjoint cliques', () => {
const result = runScript(projectRoot);
expect(result.status).toBe(0);
const batches = readBatches(projectRoot);
expect(batches.algorithm).toBe('louvain');
expect(batches.totalFiles).toBe(9);
expect(batches.batches.length).toBe(3);
expect(batches.schemaVersion).toBe(1);
expect(batches.totalBatches).toBe(3);
expect(batches.batches.map(b => b.batchIndex)).toEqual([1, 2, 3]);
// Each batch should contain exactly one clique (3 files)
for (const b of batches.batches) {
expect(b.files.length).toBe(3);
const dirs = new Set(b.files.map(f => f.path.split('/')[1]));
expect(dirs.size).toBe(1); // all files in the batch share src/<dir>/
}
});
it('produces deterministic output across runs', () => {
const r1 = runScript(projectRoot);
expect(r1.status).toBe(0);
const json1 = readFileSync(
join(projectRoot, '.understand-anything', 'intermediate', 'batches.json'),
'utf-8',
);
const r2 = runScript(projectRoot);
expect(r2.status).toBe(0);
const json2 = readFileSync(
join(projectRoot, '.understand-anything', 'intermediate', 'batches.json'),
'utf-8',
);
expect(json1).toBe(json2);
});
});
describe('compute-batches.mjs — size enforcement', () => {
let projectRoot;
beforeEach(() => {
projectRoot = setupProject('scan-result-large-community.json');
});
afterEach(() => {
if (projectRoot) rmSync(projectRoot, { recursive: true, force: true });
});
it('splits a 40-node clique into batches ≤ 35', () => {
const result = runScript(projectRoot);
expect(result.status).toBe(0);
const batches = readBatches(projectRoot);
expect(batches.algorithm).toBe('louvain'); // confirm fallback didn't fire
expect(batches.totalFiles).toBe(40);
expect(batches.batches.length).toBe(2);
expect(batches.batches.map(b => b.files.length).sort()).toEqual([20, 20]);
// Sum of all batch file counts equals total files
const sum = batches.batches.reduce((acc, b) => acc + b.files.length, 0);
expect(sum).toBe(40);
// Warning was emitted to stderr
expect(result.stderr).toMatch(/Warning: compute-batches: community size 40 > max 35/);
});
});
describe('compute-batches.mjs — exports extraction', () => {
let root;
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('populates exports for code files via tree-sitter', () => {
root = mkdtempSync(join(tmpdir(), 'ua-cb-exp-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
mkdirSync(join(root, 'src'), { recursive: true });
writeFileSync(join(root, 'src', 'a.ts'),
'export function greet(name: string) { return "hi " + name; }\n' +
'export class Greeter { greet(n: string) { return "hi " + n; } }\n');
writeFileSync(join(root, 'src', 'b.ts'),
'import { greet } from "./a";\nexport const helper = () => greet("world");\n');
const scan = {
name: 'exports-test',
description: '',
languages: ['typescript'],
frameworks: [],
files: [
{ path: 'src/a.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
{ path: 'src/b.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
],
totalFiles: 2, filteredByIgnore: 0, estimatedComplexity: 'small',
importMap: { 'src/a.ts': [], 'src/b.ts': ['src/a.ts'] },
};
writeFileSync(
join(root, '.understand-anything', 'intermediate', 'scan-result.json'),
JSON.stringify(scan));
const result = runScript(root);
expect(result.status).toBe(0);
const batches = readBatches(root);
expect(batches.exportsByPath).toBeDefined();
expect(batches.exportsByPath['src/a.ts']).toEqual(
expect.arrayContaining(['greet', 'Greeter']));
expect(batches.exportsByPath['src/b.ts']).toEqual(
expect.arrayContaining(['helper']));
});
it('emits warning when file is missing from disk (read error path)', () => {
root = mkdtempSync(join(tmpdir(), 'ua-cb-exp-err-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
// Note: NOT creating the file on disk — scan-result.json references it,
// but the file doesn't exist, so the read branch fires.
const scan = {
name: 'missing-file-test',
description: '',
languages: ['typescript'],
frameworks: [],
files: [
{ path: 'src/missing.ts', language: 'typescript', sizeLines: 1, fileCategory: 'code' },
],
totalFiles: 1, filteredByIgnore: 0, estimatedComplexity: 'small',
importMap: { 'src/missing.ts': [] },
};
writeFileSync(
join(root, '.understand-anything', 'intermediate', 'scan-result.json'),
JSON.stringify(scan));
const result = runScript(root);
expect(result.status).toBe(0); // script must still succeed
expect(result.stderr).toMatch(
/Warning: compute-batches: exports extraction failed for src\/missing\.ts \(read error:/);
const batches = readBatches(root);
expect(batches.exportsByPath['src/missing.ts']).toEqual([]);
});
});
describe('compute-batches.mjs — non-code grouping', () => {
let root;
let batches;
beforeEach(() => {
root = setupProject('scan-result-non-code.json');
const result = runScript(root);
expect(result.status).toBe(0);
batches = readBatches(root);
});
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('Group A: bundles Dockerfile cluster per directory', () => {
// Root-level cluster: Dockerfile + docker-compose.yml + .dockerignore → one batch
const rootDockerBatch = batches.batches.find(b =>
b.files.some(f => f.path === 'Dockerfile'));
expect(rootDockerBatch).toBeDefined();
const paths = rootDockerBatch.files.map(f => f.path).sort();
expect(paths).toEqual(['.dockerignore', 'Dockerfile', 'docker-compose.yml']);
// services/api cluster is a separate batch
const apiDockerBatch = batches.batches.find(b =>
b.files.some(f => f.path === 'services/api/Dockerfile'));
expect(apiDockerBatch).toBeDefined();
expect(apiDockerBatch).not.toBe(rootDockerBatch);
expect(apiDockerBatch.files.map(f => f.path).sort()).toEqual([
'services/api/Dockerfile', 'services/api/docker-compose.yml',
]);
});
it('Group B: .github/workflows/* all in one batch', () => {
const wfBatch = batches.batches.find(b =>
b.files.some(f => f.path.startsWith('.github/workflows/')));
expect(wfBatch).toBeDefined();
const wfPaths = wfBatch.files.map(f => f.path).filter(p => p.startsWith('.github/workflows/'));
expect(wfPaths.sort()).toEqual([
'.github/workflows/ci.yml', '.github/workflows/deploy.yml',
]);
});
it('Group C: .gitlab-ci.yml + .circleci/* in one batch', () => {
const ciBatch = batches.batches.find(b =>
b.files.some(f => f.path === '.gitlab-ci.yml'));
expect(ciBatch).toBeDefined();
const ciPaths = ciBatch.files.map(f => f.path).sort();
expect(ciPaths).toEqual(['.circleci/config.yml', '.gitlab-ci.yml']);
});
it('Group D: SQL migrations under migrations/ in one batch', () => {
const migBatch = batches.batches.find(b =>
b.files.some(f => f.path.startsWith('migrations/')));
expect(migBatch).toBeDefined();
const migPaths = migBatch.files.map(f => f.path).filter(p => p.startsWith('migrations/'));
expect(migPaths.sort()).toEqual([
'migrations/001_init.sql', 'migrations/002_users.sql',
]);
});
it('non-code batch indices follow code batches', () => {
const codeBatches = batches.batches.filter(b =>
b.files.every(f => f.fileCategory === 'code'));
const nonCodeBatches = batches.batches.filter(b =>
b.files.some(f => f.fileCategory !== 'code'));
expect(codeBatches.length).toBeGreaterThan(0);
expect(nonCodeBatches.length).toBeGreaterThan(0);
const maxCodeIdx = Math.max(...codeBatches.map(b => b.batchIndex));
const minNonCodeIdx = Math.min(...nonCodeBatches.map(b => b.batchIndex));
expect(minNonCodeIdx).toBeGreaterThan(maxCodeIdx);
});
});
describe('compute-batches.mjs — Group E MAX_E split', () => {
let root;
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('splits 25 .md files under docs/ into [20, 5]', () => {
root = mkdtempSync(join(tmpdir(), 'ua-cb-maxe-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
const files = [];
const importMap = {};
for (let i = 0; i < 25; i++) {
const p = `docs/page${String(i).padStart(2, '0')}.md`;
files.push({ path: p, language: 'markdown', sizeLines: 10, fileCategory: 'docs' });
importMap[p] = [];
}
const scan = {
name: 'maxe-test', description: '',
languages: ['markdown'], frameworks: [],
files, totalFiles: 25, filteredByIgnore: 0,
estimatedComplexity: 'small', importMap,
};
writeFileSync(
join(root, '.understand-anything', 'intermediate', 'scan-result.json'),
JSON.stringify(scan));
const result = runScript(root);
expect(result.status).toBe(0);
const batches = readBatches(root);
// All 25 docs/ files go through Group E with MAX_E = 20, split into [20, 5].
const docsBatches = batches.batches.filter(b =>
b.files.every(f => f.path.startsWith('docs/')));
expect(docsBatches.length).toBe(2);
const sizes = docsBatches.map(b => b.files.length).sort((a, b) => b - a);
expect(sizes).toEqual([20, 5]);
});
});
describe('compute-batches.mjs — neighborMap + batchImportData', () => {
let batches;
let batchOf; // path → batchIndex
let projectRoot;
beforeEach(() => {
projectRoot = setupProject('scan-result-3-cliques.json');
const result = runScript(projectRoot);
expect(result.status).toBe(0);
batches = readBatches(projectRoot);
batchOf = new Map();
for (const b of batches.batches) {
for (const f of b.files) batchOf.set(f.path, b.batchIndex);
}
});
afterEach(() => {
if (projectRoot) rmSync(projectRoot, { recursive: true, force: true });
});
it('batchImportData mirrors scan importMap per batch', () => {
for (const b of batches.batches) {
for (const f of b.files) {
expect(b.batchImportData[f.path]).toBeDefined();
expect(Array.isArray(b.batchImportData[f.path])).toBe(true);
}
}
// src/auth/login.ts imports src/auth/session.ts and src/auth/tokens.ts
const loginBatch = batches.batches.find(b =>
b.files.some(f => f.path === 'src/auth/login.ts'));
expect(loginBatch.batchImportData['src/auth/login.ts'].sort()).toEqual([
'src/auth/session.ts', 'src/auth/tokens.ts',
]);
});
it('neighborMap excludes same-batch files', () => {
// The fixture's three cliques each go into one batch — all imports are
// intra-batch, so no neighbor map should reference any same-batch file.
for (const b of batches.batches) {
const sameBatchPaths = new Set(b.files.map(f => f.path));
for (const [, neighbors] of Object.entries(b.neighborMap)) {
for (const n of neighbors) {
expect(sameBatchPaths.has(n.path)).toBe(false);
}
}
}
});
it('neighborMap entries carry symbols when target has exports', () => {
const root = mkdtempSync(join(tmpdir(), 'ua-cb-nbr-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
mkdirSync(join(root, 'src', 'a'), { recursive: true });
mkdirSync(join(root, 'src', 'b'), { recursive: true });
// Cluster A: 3 tightly-imported files. a/core.ts exports symbols.
writeFileSync(join(root, 'src', 'a', 'core.ts'),
'export function findUser(id: string) { return null; }\nexport class User {}\n');
writeFileSync(join(root, 'src', 'a', 'helper1.ts'),
'import { findUser } from "./core";\nexport const h1 = () => findUser("x");\n');
writeFileSync(join(root, 'src', 'a', 'helper2.ts'),
'import { User } from "./core";\nimport { h1 } from "./helper1";\nexport const h2 = () => h1();\n');
// Cluster B: 3 tightly-imported files. b/entry.ts has ONE cross-cluster import to a/core.ts.
writeFileSync(join(root, 'src', 'b', 'entry.ts'),
'import { findUser } from "../a/core";\nexport const entry = () => findUser("y");\n');
writeFileSync(join(root, 'src', 'b', 'middle.ts'),
'import { entry } from "./entry";\nexport const middle = () => entry();\n');
writeFileSync(join(root, 'src', 'b', 'leaf.ts'),
'import { middle } from "./middle";\nexport const leaf = () => middle();\n');
const files = [
{ path: 'src/a/core.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
{ path: 'src/a/helper1.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
{ path: 'src/a/helper2.ts', language: 'typescript', sizeLines: 3, fileCategory: 'code' },
{ path: 'src/b/entry.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
{ path: 'src/b/middle.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
{ path: 'src/b/leaf.ts', language: 'typescript', sizeLines: 2, fileCategory: 'code' },
];
const scan = {
name: 't', description: '',
languages: ['typescript'], frameworks: [],
files,
totalFiles: 6, filteredByIgnore: 0, estimatedComplexity: 'small',
importMap: {
'src/a/core.ts': [],
'src/a/helper1.ts': ['src/a/core.ts'],
'src/a/helper2.ts': ['src/a/core.ts', 'src/a/helper1.ts'],
'src/b/entry.ts': ['src/a/core.ts'], // CROSS-CLUSTER
'src/b/middle.ts': ['src/b/entry.ts'],
'src/b/leaf.ts': ['src/b/middle.ts'],
},
};
writeFileSync(
join(root, '.understand-anything', 'intermediate', 'scan-result.json'),
JSON.stringify(scan));
const result = runScript(root);
expect(result.status).toBe(0);
const out = readBatches(root);
// Expect 2 communities (cluster A and cluster B). Verify that some batch's
// neighborMap entry references src/a/core.ts with its symbols.
let sawSymbols = false;
for (const batch of out.batches) {
for (const [, neighbors] of Object.entries(batch.neighborMap)) {
for (const n of neighbors) {
if (n.path === 'src/a/core.ts') {
expect(n.symbols).toEqual(expect.arrayContaining(['findUser', 'User']));
sawSymbols = true;
}
}
}
}
expect(sawSymbols).toBe(true);
rmSync(root, { recursive: true, force: true });
});
});
describe('compute-batches.mjs — neighborMap truncation', () => {
let root;
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('truncates and warns when neighbors > 50', () => {
root = mkdtempSync(join(tmpdir(), 'ua-cb-trunc-'));
mkdirSync(join(root, '.understand-anything', 'intermediate'), { recursive: true });
// hub.ts imported by 60 other files
const files = [{ path: 'src/hub.ts', language: 'typescript', sizeLines: 1, fileCategory: 'code' }];
const importMap = { 'src/hub.ts': [] };
for (let i = 0; i < 60; i++) {
const p = `src/leaf${i}.ts`;
files.push({ path: p, language: 'typescript', sizeLines: 1, fileCategory: 'code' });
importMap[p] = ['src/hub.ts'];
}
const scan = {
name: 't', description: '', languages: ['typescript'], frameworks: [],
files, totalFiles: files.length, filteredByIgnore: 0,
estimatedComplexity: 'moderate', importMap,
};
writeFileSync(
join(root, '.understand-anything', 'intermediate', 'scan-result.json'),
JSON.stringify(scan));
const result = runScript(root);
expect(result.status).toBe(0);
expect(result.stderr).toMatch(
/neighborMap for src\/hub\.ts has high 1-hop degree 60 — exceeds soft cap of 50/);
const out = readBatches(root);
// Find hub.ts and confirm its neighbor list capped at 50 (in whichever batch it landed)
for (const b of out.batches) {
const nbrs = b.neighborMap['src/hub.ts'];
if (nbrs) expect(nbrs.length).toBeLessThanOrEqual(50);
}
});
});
describe('compute-batches.mjs — fallback', () => {
let root;
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('falls back to count-based when Louvain throws (env-injected mock)', () => {
// We can't easily monkey-patch louvain mid-script in Vitest because the
// script runs in a subprocess. Instead, set an env var the script honors:
// UA_COMPUTE_BATCHES_FORCE_LOUVAIN_THROW=1 → script throws inside its
// Louvain branch, exercising the fallback path.
root = setupProject('scan-result-3-cliques.json');
const result = spawnSync('node',
[SCRIPT, root],
{ encoding: 'utf-8', env: { ...process.env, UA_COMPUTE_BATCHES_FORCE_LOUVAIN_THROW: '1' } },
);
expect(result.status).toBe(0);
expect(result.stderr).toMatch(
/Warning: compute-batches: Louvain failed.*falling back to count-based grouping/);
const out = readBatches(root);
expect(out.algorithm).toBe('count-fallback');
expect(out.totalFiles).toBe(9);
// Count-based: 12 files per batch → all 9 fit in one batch
const codeBatchFileCount = out.batches
.filter(b => b.files.every(f => f.fileCategory === 'code'))
.reduce((sum, b) => sum + b.files.length, 0);
expect(codeBatchFileCount).toBe(9);
});
});
describe('compute-batches.mjs — merge-small', () => {
let projectRoot;
beforeEach(() => {
projectRoot = setupProject('scan-result-singletons.json');
});
afterEach(() => {
if (projectRoot) rmSync(projectRoot, { recursive: true, force: true });
});
it('merges 100 isolated singletons into a small number of misc batches', () => {
const result = runScript(projectRoot);
expect(result.status).toBe(0);
const batches = readBatches(projectRoot);
expect(batches.totalFiles).toBe(100);
// Without merge: 100 singletons → 100 batches.
// With merge-small (MAX_MERGE_TARGET=25): ceil(100 / 25) = exactly 4 misc
// batches. Pin the exact count — a loose >=4 && <=8 would mask off-by-one
// regressions in the slice math (e.g., a stride miscalculation that
// splintered the pool into 5-7 underfull buckets).
expect(batches.batches.length).toBe(4);
// All files accounted for
const totalAssigned = batches.batches.reduce((sum, b) => sum + b.files.length, 0);
expect(totalAssigned).toBe(100);
// Bucket-fullness check: 100 singletons evenly divisible by
// MAX_MERGE_TARGET=25, so every bucket must be exactly 25 — not just
// ≤ 25. Drift toward [25, 25, 25, 24, 1] etc. would slip past a
// ≤25 bound while indicating a stride bug.
for (const b of batches.batches) {
expect(b.files.length).toBe(25);
}
// Info: (not Warning:) — merge-small is a routine optimization, not a
// fallback path. See compute-batches.mjs mergeSmallBatches WHY comment.
expect(result.stderr).toMatch(
/Info: compute-batches: merged \d+ small batches \(\d+ files\) into \d+ misc batches/);
expect(result.stderr).not.toMatch(/Warning: compute-batches: merged \d+ small batches/);
});
it('preserves non-mergeable batches: Dockerfile cluster not pooled into misc', () => {
// Dedicated fixture: 30 isolated TS singletons + 1 Dockerfile-only cluster.
// Group A marks the Dockerfile batch mergeable=false; even though its size
// (1) is below MIN_BATCH_SIZE=3, mergeSmallBatches must leave it intact.
const altRoot = setupProject('scan-result-merge-respects-non-mergeable.json');
try {
const result = runScript(altRoot);
expect(result.status).toBe(0);
const out = readBatches(altRoot);
expect(out.totalFiles).toBe(31);
const dockerBatch = out.batches.find(b =>
b.files.some(f => f.path === 'services/api/Dockerfile'));
expect(dockerBatch).toBeDefined();
// Standalone: exactly the Dockerfile, nothing pooled in alongside it.
expect(dockerBatch.files.length).toBe(1);
expect(dockerBatch.files[0].path).toBe('services/api/Dockerfile');
// The TS singletons must still merge into at least one misc batch —
// and that misc batch must NOT contain the Dockerfile.
const miscBatches = out.batches.filter(b =>
b.files.some(f => f.path.startsWith('src/leaf')));
expect(miscBatches.length).toBeGreaterThanOrEqual(1);
for (const m of miscBatches) {
for (const f of m.files) {
expect(f.path).not.toBe('services/api/Dockerfile');
}
}
// Every TS singleton accounted for across the misc bucket(s).
const tsInMisc = miscBatches.flatMap(b => b.files.map(f => f.path))
.filter(p => p.startsWith('src/leaf'));
expect(tsInMisc.length).toBe(30);
} finally {
rmSync(altRoot, { recursive: true, force: true });
}
});
});
describe('compute-batches.mjs — --changed-files', () => {
let root;
afterEach(() => {
if (root) rmSync(root, { recursive: true, force: true });
});
it('emits only batches containing changed files', () => {
root = setupProject('scan-result-3-cliques.json');
const changedPath = join(root, 'changed.txt');
// Only the auth clique is changed
writeFileSync(changedPath, ['src/auth/login.ts', 'src/auth/tokens.ts'].join('\n'));
const result = runScript(root, [`--changed-files=${changedPath}`]);
expect(result.status).toBe(0);
const out = readBatches(root);
// Auth files are in batches; other cliques' batches must be omitted
const allPaths = out.batches.flatMap(b => b.files.map(f => f.path));
expect(allPaths).toContain('src/auth/login.ts');
expect(allPaths).toContain('src/auth/tokens.ts');
expect(allPaths).not.toContain('src/api/handlers.ts');
expect(allPaths).not.toContain('src/db/users.ts');
// neighborMap may still reference unchanged files (with their full-graph batchIndex)
const loginBatch = out.batches.find(b =>
b.files.some(f => f.path === 'src/auth/login.ts'));
expect(loginBatch).toBeDefined();
});
});
File diff suppressed because it is too large Load Diff
@@ -2,8 +2,8 @@
"""
test_merge_batch_graphs.py Tests for the deterministic tested_by linker.
Run from this directory:
python -m unittest test_merge_batch_graphs.py -v
Run from the repo root:
python -m unittest tests.skill.understand.test_merge_batch_graphs -v
"""
from __future__ import annotations
@@ -20,7 +20,14 @@ from typing import Any
# directly. Load it via importlib so we can call its module-level helpers.
_HERE = Path(__file__).resolve().parent
_MODULE_PATH = _HERE / "merge-batch-graphs.py"
_REPO_ROOT = _HERE.parent.parent.parent
_MODULE_PATH = (
_REPO_ROOT
/ "understand-anything-plugin"
/ "skills"
/ "understand"
/ "merge-batch-graphs.py"
)
def _load_module() -> Any:
@@ -941,5 +948,240 @@ class MergeEdgeDirectionTests(unittest.TestCase):
self.assertEqual(edges[0]["weight"], 0.9)
# ── Multi-part batch handling ─────────────────────────────────────────────
class TestMultiPart(unittest.TestCase):
"""End-to-end tests for batch-<i>-part-<k>.json input handling.
These tests invoke merge-batch-graphs.py as a subprocess in a temp
directory so we exercise the full path: glob load merge write.
"""
def setUp(self) -> None:
import tempfile
self.tmp = Path(tempfile.mkdtemp(prefix="ua-mbg-"))
self.intermediate = self.tmp / ".understand-anything" / "intermediate"
self.intermediate.mkdir(parents=True, exist_ok=True)
def tearDown(self) -> None:
import shutil
shutil.rmtree(self.tmp, ignore_errors=True)
def _write_batch(self, name: str, nodes: list, edges: list) -> None:
import json as _j
(self.intermediate / name).write_text(
_j.dumps({"nodes": nodes, "edges": edges}),
encoding="utf-8",
)
def _run_merge(self) -> tuple[int, str, dict]:
import subprocess
import json as _j
result = subprocess.run(
["python3", str(_MODULE_PATH), str(self.tmp)],
capture_output=True, text=True,
)
out_path = self.intermediate / "assembled-graph.json"
assembled = _j.loads(out_path.read_text()) if out_path.exists() else {}
return result.returncode, result.stderr, assembled
def test_two_parts_of_one_logical_batch_merge(self) -> None:
self._write_batch("batch-1-part-1.json",
[_file_node("src/a.ts")],
[{"source": "file:src/a.ts", "target": "file:src/b.ts",
"type": "imports", "direction": "forward", "weight": 0.7}])
self._write_batch("batch-1-part-2.json",
[_file_node("src/b.ts")],
[])
rc, _stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
node_ids = {n["id"] for n in assembled["nodes"]}
self.assertEqual(node_ids, {"file:src/a.ts", "file:src/b.ts"})
# Cross-part edge survived
edge_keys = {(e["source"], e["target"], e["type"]) for e in assembled["edges"]}
self.assertIn(
("file:src/a.ts", "file:src/b.ts", "imports"), edge_keys)
def test_three_parts_of_one_logical_batch_merge(self) -> None:
for k, path in enumerate(["src/a.ts", "src/b.ts", "src/c.ts"], start=1):
self._write_batch(f"batch-1-part-{k}.json",
[_file_node(path)], [])
rc, _stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
node_ids = {n["id"] for n in assembled["nodes"]}
self.assertEqual(node_ids,
{"file:src/a.ts", "file:src/b.ts", "file:src/c.ts"})
def test_malformed_part_is_skipped_with_warning(self) -> None:
(self.intermediate / "batch-1-part-1.json").write_text(
"{ this is not valid json", encoding="utf-8")
self._write_batch("batch-1-part-2.json",
[_file_node("src/b.ts")], [])
rc, stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
# The skip warning is from existing load_batch logic
self.assertIn("skipping batch-1-part-1.json", stderr)
# part-2 content still made it in
node_ids = {n["id"] for n in assembled["nodes"]}
self.assertEqual(node_ids, {"file:src/b.ts"})
def test_mixed_single_and_multi_part(self) -> None:
self._write_batch("batch-1.json",
[_file_node("src/single.ts")], [])
self._write_batch("batch-2-part-1.json",
[_file_node("src/multi-a.ts")], [])
self._write_batch("batch-2-part-2.json",
[_file_node("src/multi-b.ts")], [])
self._write_batch("batch-3.json",
[_file_node("src/another-single.ts")], [])
rc, _stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
node_ids = {n["id"] for n in assembled["nodes"]}
self.assertEqual(node_ids, {
"file:src/single.ts", "file:src/multi-a.ts",
"file:src/multi-b.ts", "file:src/another-single.ts",
})
def test_missing_part_emits_warning(self) -> None:
# parts {2, 3} present, part-1 missing
self._write_batch("batch-1-part-2.json",
[_file_node("src/b.ts")], [])
self._write_batch("batch-1-part-3.json",
[_file_node("src/c.ts")], [])
rc, stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
self.assertRegex(stderr,
r"Warning: merge: batch 1 has parts \[2, 3\] but "
r"missing part \[1\] — possible truncated write")
def test_stderr_report_format(self) -> None:
self._write_batch("batch-1.json", [_file_node("src/a.ts")], [])
self._write_batch("batch-2-part-1.json", [_file_node("src/b.ts")], [])
self._write_batch("batch-2-part-2.json", [_file_node("src/c.ts")], [])
rc, stderr, _assembled = self._run_merge()
self.assertEqual(rc, 0)
# 3 files on disk, 2 logical batches, 1 multi-part
self.assertIn(
"Found 3 batch files (2 logical batches, 1 multi-part)", stderr)
# ── Unrecognized batch filename handling ───────────────────────────────────
class TestUnrecognizedBatchFilename(unittest.TestCase):
"""File-analyzer fuses multiple batches into one output (e.g.,
`batch-fused-8-13.json`, `batch-8-13.json`) the merge script's regex
requires `batch-<N>.json` or `batch-<N>-part-<K>.json` and would
otherwise silently drop the contents. The script must warn loudly and
surface the drop in its report so the downstream review step catches it.
"""
def setUp(self) -> None:
import tempfile
self.tmp = Path(tempfile.mkdtemp(prefix="ua-mbg-unrec-"))
self.intermediate = self.tmp / ".understand-anything" / "intermediate"
self.intermediate.mkdir(parents=True, exist_ok=True)
def tearDown(self) -> None:
import shutil
shutil.rmtree(self.tmp, ignore_errors=True)
def _write_batch(self, name: str, nodes: list, edges: list) -> None:
import json as _j
(self.intermediate / name).write_text(
_j.dumps({"nodes": nodes, "edges": edges}),
encoding="utf-8",
)
def _run_merge(self) -> tuple[int, str, dict]:
import subprocess
import json as _j
result = subprocess.run(
["python3", str(_MODULE_PATH), str(self.tmp)],
capture_output=True, text=True,
)
out_path = self.intermediate / "assembled-graph.json"
assembled = _j.loads(out_path.read_text()) if out_path.exists() else {}
return result.returncode, result.stderr, assembled
def test_fused_filename_emits_stderr_warning(self) -> None:
# `batch-fused-3-5.json` does not match the merge regex —
# script must warn on stderr (not silently drop).
self._write_batch("batch-1.json", [_file_node("src/a.ts")], [])
self._write_batch("batch-2.json", [_file_node("src/b.ts")], [])
self._write_batch(
"batch-fused-3-5.json",
[_file_node("src/c.ts"), _file_node("src/d.ts"), _file_node("src/e.ts")],
[],
)
rc, stderr, _assembled = self._run_merge()
self.assertEqual(rc, 0)
self.assertIn("Warning: merge-batch-graphs:", stderr)
self.assertIn("unrecognized filenames", stderr)
self.assertIn("batch-fused-3-5.json", stderr)
# Remediation hint must be present so users know what to fix.
self.assertIn("file-analyzer", stderr)
self.assertIn("batch-<N>.json", stderr)
def test_fused_filename_surfaces_in_report(self) -> None:
# The merge report (printed after the per-file load lines) must
# also flag the drop so Phase 3 review picks it up.
self._write_batch("batch-1.json", [_file_node("src/a.ts")], [])
self._write_batch(
"batch-fused-2-4.json", [_file_node("src/x.ts")], [],
)
rc, stderr, _assembled = self._run_merge()
self.assertEqual(rc, 0)
# "dropped N batch file(s) with unrecognized filenames" appears in the
# report section (printed after "Output: ..." line).
self.assertIn("dropped 1 batch file(s) with unrecognized filenames", stderr)
self.assertIn("batch-fused-2-4.json", stderr)
self.assertIn(
"every node/edge in these files was excluded from the final graph",
stderr,
)
def test_recognized_batches_still_loaded(self) -> None:
# With both recognized and unrecognized files present, recognized
# ones must still produce a valid assembled graph.
self._write_batch("batch-1.json", [_file_node("src/a.ts")], [])
self._write_batch("batch-2.json", [_file_node("src/b.ts")], [])
self._write_batch(
"batch-fused-3-5.json",
[_file_node("src/dropped-c.ts")],
[],
)
rc, _stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
node_ids = {n["id"] for n in assembled["nodes"]}
# batch-1 + batch-2 survive
self.assertIn("file:src/a.ts", node_ids)
self.assertIn("file:src/b.ts", node_ids)
# batch-fused-3-5.json content is excluded
self.assertNotIn("file:src/dropped-c.ts", node_ids)
self.assertEqual(node_ids, {"file:src/a.ts", "file:src/b.ts"})
def test_range_filename_also_unrecognized(self) -> None:
# A bare range like `batch-8-13.json` is just as broken as
# `batch-fused-8-13.json` — both must be flagged. The regex
# `batch-(\d+)(?:-part-(\d+))?\.json` requires the literal
# `-part-` separator before a second number.
self._write_batch("batch-1.json", [_file_node("src/a.ts")], [])
self._write_batch(
"batch-8-13.json",
[_file_node("src/x.ts"), _file_node("src/y.ts")],
[],
)
rc, stderr, assembled = self._run_merge()
self.assertEqual(rc, 0)
self.assertIn("Warning: merge-batch-graphs:", stderr)
self.assertIn("batch-8-13.json", stderr)
# Content is dropped
node_ids = {n["id"] for n in assembled["nodes"]}
self.assertNotIn("file:src/x.ts", node_ids)
self.assertNotIn("file:src/y.ts", node_ids)
if __name__ == "__main__":
unittest.main()
@@ -0,0 +1,738 @@
import { describe, it, expect, afterEach } from 'vitest';
import {
mkdtempSync,
mkdirSync,
writeFileSync,
readFileSync,
rmSync,
chmodSync,
existsSync,
} from 'node:fs';
import { tmpdir } from 'node:os';
import { join, dirname, resolve } from 'node:path';
import { spawnSync } from 'node:child_process';
import { fileURLToPath } from 'node:url';
const __dirname = dirname(fileURLToPath(import.meta.url));
const SCRIPT = resolve(
__dirname,
'../../../understand-anything-plugin/skills/understand/scan-project.mjs',
);
/**
* Build a project tree from a `{ relPath: contents }` object. Creates parent
* directories as needed. Initializes a real git repo so the script's preferred
* `git ls-files` enumeration path is exercised — tests that need the walker
* fallback can set `gitInit=false`.
*/
function setupTree(files, { gitInit = true } = {}) {
const root = mkdtempSync(join(tmpdir(), 'ua-scan-test-'));
for (const [relPath, contents] of Object.entries(files)) {
const abs = join(root, relPath);
mkdirSync(dirname(abs), { recursive: true });
writeFileSync(abs, contents, 'utf-8');
}
if (gitInit) {
// `git ls-files -co --exclude-standard` returns BOTH cached and others
// (modulo gitignore), so an `add` is unnecessary for our tests — the
// bare repo init is enough for ls-files to enumerate.
const init = spawnSync('git', ['init', '-q'], { cwd: root, encoding: 'utf-8' });
if (init.status !== 0) {
// CI without git: continue without it; the walker fallback will fire.
}
}
return root;
}
/**
* Tracks every temp output dir created by runScript() so the global
* cleanup can sweep them between tests. The output file must live
* OUTSIDE projectRoot because the project's default ignore patterns
* do NOT exclude `.understand-anything/` (the dir is reserved for
* persistent state, not transient scratch). If we wrote inside
* projectRoot, the second call in the determinism test would
* enumerate the first call's output file and produce drift.
*/
const _runScriptOutputDirs = [];
/**
* Run scan-project.mjs against `projectRoot`. Returns
* { status, stdout, stderr, output } where `output` is the parsed JSON
* written by the script (or null on failure).
*/
function runScript(projectRoot) {
const outputDir = mkdtempSync(join(tmpdir(), 'ua-scan-out-'));
_runScriptOutputDirs.push(outputDir);
const outputPath = join(outputDir, 'scan-output.json');
const result = spawnSync('node', [SCRIPT, projectRoot, outputPath], {
encoding: 'utf-8',
});
let output = null;
try {
output = JSON.parse(readFileSync(outputPath, 'utf-8'));
} catch {
/* output missing on hard failure */
}
return { status: result.status, stdout: result.stdout, stderr: result.stderr, output };
}
/**
* Look up the `files[]` entry for a given path. Returns undefined if not
* present — callers should `expect(byPath('x')).toBeDefined()` first.
*/
function byPath(output, path) {
return output.files.find(f => f.path === path);
}
// Sweep every output dir created during a test back to disk-empty between
// tests. The top-level afterEach fires after each `it()` regardless of which
// describe block it lives in, so a single hook covers the whole file.
afterEach(() => {
while (_runScriptOutputDirs.length) {
const d = _runScriptOutputDirs.pop();
rmSync(d, { recursive: true, force: true });
}
});
describe('scan-project.mjs — language detection', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('maps TypeScript/JavaScript extensions to typescript/javascript', () => {
projectRoot = setupTree({
'a.ts': 'export const a = 1;\n',
'b.tsx': 'export const B = () => null;\n',
'c.js': 'module.exports = {};\n',
'd.jsx': 'export default () => null;\n',
'e.mjs': 'export const e = 1;\n',
'f.cjs': 'module.exports = 1;\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.ts').language).toBe('typescript');
expect(byPath(r.output, 'b.tsx').language).toBe('typescript');
expect(byPath(r.output, 'c.js').language).toBe('javascript');
expect(byPath(r.output, 'd.jsx').language).toBe('javascript');
expect(byPath(r.output, 'e.mjs').language).toBe('javascript');
expect(byPath(r.output, 'f.cjs').language).toBe('javascript');
});
it('maps Python, Go, Rust, Java, Kotlin, C# to their language ids', () => {
projectRoot = setupTree({
'a.py': 'x = 1\n',
'b.go': 'package main\n',
'c.rs': 'fn main() {}\n',
'd.java': 'class D {}\n',
'e.kt': 'fun main() {}\n',
'f.cs': 'class F {}\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.py').language).toBe('python');
expect(byPath(r.output, 'b.go').language).toBe('go');
expect(byPath(r.output, 'c.rs').language).toBe('rust');
expect(byPath(r.output, 'd.java').language).toBe('java');
expect(byPath(r.output, 'e.kt').language).toBe('kotlin');
expect(byPath(r.output, 'f.cs').language).toBe('csharp');
});
it('maps Ruby, PHP, C, C++ to their language ids', () => {
projectRoot = setupTree({
'a.rb': 'puts 1\n',
'b.php': '<?php echo 1;\n',
'c.c': 'int main() { return 0; }\n',
'd.h': 'void f();\n',
'e.cpp': 'int main() {}\n',
'f.hpp': 'class F {};\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.rb').language).toBe('ruby');
expect(byPath(r.output, 'b.php').language).toBe('php');
expect(byPath(r.output, 'c.c').language).toBe('c');
expect(byPath(r.output, 'd.h').language).toBe('c');
expect(byPath(r.output, 'e.cpp').language).toBe('cpp');
expect(byPath(r.output, 'f.hpp').language).toBe('cpp');
});
it('maps web markup (HTML, CSS) to their language ids', () => {
projectRoot = setupTree({
'a.html': '<!doctype html><html></html>\n',
'b.htm': '<html></html>\n',
'c.css': '.a { }\n',
'd.scss': '$x: 1;\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.html').language).toBe('html');
expect(byPath(r.output, 'b.htm').language).toBe('html');
expect(byPath(r.output, 'c.css').language).toBe('css');
expect(byPath(r.output, 'd.scss').language).toBe('css');
});
it('maps configuration formats (YAML, JSON, JSONC, TOML, XML, Markdown) to their language ids', () => {
projectRoot = setupTree({
'a.yaml': 'x: 1\n',
'b.yml': 'x: 1\n',
'c.json': '{}\n',
'd.jsonc': '{ /* c */ }\n',
'e.toml': 'x = 1\n',
'f.xml': '<x/>\n',
'g.md': '# h\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.yaml').language).toBe('yaml');
expect(byPath(r.output, 'b.yml').language).toBe('yaml');
expect(byPath(r.output, 'c.json').language).toBe('json');
expect(byPath(r.output, 'd.jsonc').language).toBe('jsonc');
expect(byPath(r.output, 'e.toml').language).toBe('toml');
expect(byPath(r.output, 'f.xml').language).toBe('xml');
expect(byPath(r.output, 'g.md').language).toBe('markdown');
});
it('maps shell + batch + Dockerfile (no extension) to their language ids', () => {
projectRoot = setupTree({
'a.sh': 'echo 1\n',
'b.bat': '@echo off\n',
Dockerfile: 'FROM node:22\n',
'Dockerfile.dev': 'FROM node:22\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'a.sh').language).toBe('shell');
expect(byPath(r.output, 'b.bat').language).toBe('batch');
expect(byPath(r.output, 'Dockerfile').language).toBe('dockerfile');
expect(byPath(r.output, 'Dockerfile.dev').language).toBe('dockerfile');
});
it('falls back to "unknown" for files with no extension and no filename match', () => {
projectRoot = setupTree({
WEIRD_FILE: 'mystery contents\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'WEIRD_FILE').language).toBe('unknown');
});
it('falls back to bare extension (without dot) for unknown extensions', () => {
projectRoot = setupTree({
'data.weirdext': 'some data\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'data.weirdext').language).toBe('weirdext');
});
});
describe('scan-project.mjs — category assignment (project-scanner.md Step 4)', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('assigns code to TypeScript, JavaScript, Python, Go, Rust source files', () => {
projectRoot = setupTree({
'src/a.ts': 'export const a = 1;\n',
'src/b.py': 'def b(): pass\n',
'src/c.go': 'package main\n',
'src/d.rs': 'fn main() {}\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'src/a.ts').fileCategory).toBe('code');
expect(byPath(r.output, 'src/b.py').fileCategory).toBe('code');
expect(byPath(r.output, 'src/c.go').fileCategory).toBe('code');
expect(byPath(r.output, 'src/d.rs').fileCategory).toBe('code');
});
it('assigns config to JSON/YAML/TOML/INI/XML', () => {
projectRoot = setupTree({
'package.json': '{}\n',
'tsconfig.json': '{}\n',
'pyproject.toml': '[project]\nname = "p"\n',
'config.yaml': 'x: 1\n',
'app.ini': '[s]\nk=v\n',
'data.xml': '<x/>\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'package.json').fileCategory).toBe('config');
expect(byPath(r.output, 'tsconfig.json').fileCategory).toBe('config');
expect(byPath(r.output, 'pyproject.toml').fileCategory).toBe('config');
expect(byPath(r.output, 'config.yaml').fileCategory).toBe('config');
expect(byPath(r.output, 'app.ini').fileCategory).toBe('config');
expect(byPath(r.output, 'data.xml').fileCategory).toBe('config');
});
it('assigns docs to .md / .rst / .txt (but NOT to LICENSE)', () => {
projectRoot = setupTree({
'README.md': '# x\n',
'docs/guide.rst': 'Guide\n=====\n',
'NOTES.txt': 'notes\n',
LICENSE: 'Apache-2.0\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'README.md').fileCategory).toBe('docs');
expect(byPath(r.output, 'docs/guide.rst').fileCategory).toBe('docs');
expect(byPath(r.output, 'NOTES.txt').fileCategory).toBe('docs');
// LICENSE exception: must NOT be docs. The default ignore filter
// normally drops LICENSE entirely, so we re-include it via
// `!LICENSE` so the category test can fire.
writeFileSync(join(projectRoot, '.understandignore'), '!LICENSE\n');
const r2 = runScript(projectRoot);
const license = byPath(r2.output, 'LICENSE');
expect(license).toBeDefined();
expect(license.fileCategory).not.toBe('docs');
});
it('assigns infra to Dockerfile, docker-compose, .gitlab-ci.yml, .tf, .github/workflows/, Makefile, Jenkinsfile, k8s paths', () => {
projectRoot = setupTree({
Dockerfile: 'FROM node:22\n',
'docker-compose.yml': 'services: {}\n',
'.gitlab-ci.yml': 'stages: []\n',
'infra/main.tf': 'resource "x" "y" {}\n',
'.github/workflows/ci.yml': 'name: ci\n',
Makefile: 'all:\n\t@echo hi\n',
Jenkinsfile: 'pipeline { }\n',
'k8s/deploy.yaml': 'kind: Deployment\n',
'kubernetes/svc.yaml': 'kind: Service\n',
'foo.k8s.yaml': 'kind: ConfigMap\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'Dockerfile').fileCategory).toBe('infra');
expect(byPath(r.output, 'docker-compose.yml').fileCategory).toBe('infra');
expect(byPath(r.output, '.gitlab-ci.yml').fileCategory).toBe('infra');
expect(byPath(r.output, 'infra/main.tf').fileCategory).toBe('infra');
expect(byPath(r.output, '.github/workflows/ci.yml').fileCategory).toBe('infra');
expect(byPath(r.output, 'Makefile').fileCategory).toBe('infra');
expect(byPath(r.output, 'Jenkinsfile').fileCategory).toBe('infra');
expect(byPath(r.output, 'k8s/deploy.yaml').fileCategory).toBe('infra');
expect(byPath(r.output, 'kubernetes/svc.yaml').fileCategory).toBe('infra');
expect(byPath(r.output, 'foo.k8s.yaml').fileCategory).toBe('infra');
});
it('assigns data to SQL, GraphQL, Proto, Prisma, CSV', () => {
projectRoot = setupTree({
'db/schema.sql': 'CREATE TABLE x (id INT);\n',
'api/schema.graphql': 'type X { id: ID! }\n',
'api/types.proto': 'syntax = "proto3";\n',
'prisma/schema.prisma': 'model X { id Int @id }\n',
'data/seed.csv': 'a,b\n1,2\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'db/schema.sql').fileCategory).toBe('data');
expect(byPath(r.output, 'api/schema.graphql').fileCategory).toBe('data');
expect(byPath(r.output, 'api/types.proto').fileCategory).toBe('data');
expect(byPath(r.output, 'prisma/schema.prisma').fileCategory).toBe('data');
expect(byPath(r.output, 'data/seed.csv').fileCategory).toBe('data');
});
it('assigns script to shell + batch files (.sh, .bash, .ps1, .bat)', () => {
projectRoot = setupTree({
'scripts/build.sh': '#!/bin/bash\necho 1\n',
'scripts/run.bash': '#!/bin/bash\necho run\n',
'scripts/build.ps1': 'Write-Output 1\n',
'scripts/setup.bat': '@echo off\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'scripts/build.sh').fileCategory).toBe('script');
expect(byPath(r.output, 'scripts/run.bash').fileCategory).toBe('script');
expect(byPath(r.output, 'scripts/build.ps1').fileCategory).toBe('script');
expect(byPath(r.output, 'scripts/setup.bat').fileCategory).toBe('script');
});
it('assigns markup to HTML + CSS variants', () => {
projectRoot = setupTree({
'public/index.html': '<!doctype html>\n',
'public/page.htm': '<html></html>\n',
'styles/app.css': 'body { }\n',
'styles/app.scss': '$x: 1;\n',
'styles/app.less': '@x: 1;\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'public/index.html').fileCategory).toBe('markup');
expect(byPath(r.output, 'public/page.htm').fileCategory).toBe('markup');
expect(byPath(r.output, 'styles/app.css').fileCategory).toBe('markup');
expect(byPath(r.output, 'styles/app.scss').fileCategory).toBe('markup');
expect(byPath(r.output, 'styles/app.less').fileCategory).toBe('markup');
});
it('priority: docker-compose.yml maps to infra, not config', () => {
// The .yml extension would normally route to `config`, but the
// docker-compose.* filename rule fires first per Step 4 priority.
projectRoot = setupTree({
'docker-compose.yml': 'services: {}\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'docker-compose.yml').fileCategory).toBe('infra');
expect(byPath(r.output, 'docker-compose.yml').language).toBe('yaml');
});
// Regression: path.extname returns '' for `.env` and the second segment
// for `.env.local` — neither hits CATEGORY_BY_EXT['.env']. Dotfile-style
// configs were falling through to `code` / `unknown`. Caught by Codex
// review on PR #204.
it('dotfile configs (.env, .env.local, .env.production) map to config + env language', () => {
projectRoot = setupTree({
'.env': 'API_KEY=abc\n',
'.env.local': 'LOCAL=1\n',
'.env.production': 'PROD=1\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
for (const p of ['.env', '.env.local', '.env.production']) {
expect(byPath(r.output, p).fileCategory).toBe('config');
// LANGUAGE_BY_EXT['.env'] -> 'config' (the language id itself; not
// a typo — the language for env files is the 'config' bucket).
expect(byPath(r.output, p).language).toBe('config');
}
});
});
describe('scan-project.mjs — .understandignore handling', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('respects .understandignore patterns and increments filteredByIgnore', () => {
// `**/*.log` is NOT in the hardcoded defaults at the recursive level
// — wait, `*.log` is. Use a custom pattern to exercise user-driven drops.
projectRoot = setupTree({
'.understandignore': 'fixtures/\n',
'src/index.ts': 'export const x = 1;\n',
'fixtures/snap1.json': '{ "a": 1 }\n',
'fixtures/snap2.json': '{ "b": 2 }\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
// fixtures/ files dropped
expect(byPath(r.output, 'fixtures/snap1.json')).toBeUndefined();
expect(byPath(r.output, 'fixtures/snap2.json')).toBeUndefined();
// Counted as user-driven
expect(r.output.filteredByIgnore).toBe(2);
});
it('supports `!pattern` negation to re-include defaults-excluded files', () => {
// `*.log` is in the hardcoded defaults; the user re-includes a
// specific file with `!keep.log`. After the override, keep.log MUST
// appear in the output. It is NOT counted in filteredByIgnore (it
// was re-included, not additionally filtered).
projectRoot = setupTree({
'.understandignore': '!keep.log\n',
'src/index.ts': 'export const x = 1;\n',
'keep.log': 'important diagnostic\n',
'drop.log': 'noise\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(byPath(r.output, 'keep.log')).toBeDefined();
// drop.log still excluded by defaults (no negation for it)
expect(byPath(r.output, 'drop.log')).toBeUndefined();
// The defaults dropped drop.log — that's a baseline default drop,
// NOT a user-driven drop. filteredByIgnore should be 0.
expect(r.output.filteredByIgnore).toBe(0);
});
});
describe('scan-project.mjs — special-file recognition', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('Dockerfile (no extension) is language=dockerfile, category=infra', () => {
projectRoot = setupTree({
Dockerfile: 'FROM alpine:3\nCMD ["sh"]\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
const entry = byPath(r.output, 'Dockerfile');
expect(entry).toBeDefined();
expect(entry.language).toBe('dockerfile');
expect(entry.fileCategory).toBe('infra');
});
});
describe('scan-project.mjs — determinism', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('produces byte-identical output across runs for the same input tree', () => {
projectRoot = setupTree({
'README.md': '# project\n',
'src/a.ts': 'export const a = 1;\n',
'src/b.ts': 'export const b = 2;\n',
'src/lib/c.ts': 'export const c = 3;\n',
'package.json': '{}\n',
'tsconfig.json': '{}\n',
});
const r1 = runScript(projectRoot);
const r2 = runScript(projectRoot);
expect(r1.status).toBe(0);
expect(r2.status).toBe(0);
expect(JSON.stringify(r1.output)).toBe(JSON.stringify(r2.output));
});
});
describe('scan-project.mjs — empty repo', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('handles a project with zero files without crashing', () => {
projectRoot = setupTree({}, { gitInit: true });
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.scriptCompleted).toBe(true);
expect(r.output.totalFiles).toBe(0);
expect(r.output.files).toEqual([]);
expect(r.output.filteredByIgnore).toBe(0);
expect(r.output.estimatedComplexity).toBe('small');
});
});
describe('scan-project.mjs — per-file failure resilience', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
// Restore permissions on any chmod'd file before delete, so cleanup
// succeeds even when a test left a 000-permission file behind.
try {
const f = join(projectRoot, 'src/unreadable.ts');
if (existsSync(f)) chmodSync(f, 0o644);
} catch { /* best-effort */ }
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('emits a Warning: and skips a file with unreadable permissions; other files survive', () => {
if (process.platform === 'win32') {
// chmod permission bits don't apply on Windows the same way; skip.
return;
}
if (process.getuid && process.getuid() === 0) {
// Running as root bypasses permission checks; the test cannot exercise
// its failure mode. Skip rather than emit a false pass.
return;
}
projectRoot = setupTree({
'src/good.ts': 'export const good = 1;\n',
'src/unreadable.ts': 'export const bad = 2;\n',
});
// Strip read permission on the synthetic file.
chmodSync(join(projectRoot, 'src/unreadable.ts'), 0o000);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.scriptCompleted).toBe(true);
// The good file is in the output.
expect(byPath(r.output, 'src/good.ts')).toBeDefined();
// The unreadable file is dropped.
expect(byPath(r.output, 'src/unreadable.ts')).toBeUndefined();
// A visible warning was emitted with the documented prefix.
expect(r.stderr).toMatch(
/Warning: scan-project: src\/unreadable\.ts — line count failed/,
);
expect(r.stderr).toMatch(/file skipped from output/);
// Final summary line still fires.
expect(r.stderr).toMatch(
/scan-project: filesScanned=1 filteredByIgnore=0 complexity=small/,
);
});
});
describe('scan-project.mjs — estimatedComplexity thresholds', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
/**
* Build a tree with exactly N .ts files at the top level. Used to
* lock in the complexity-tier boundary points from project-scanner.md
* Step 7: small (≤30), moderate (31-150), large (151-500), very-large
* (>500).
*/
function setupNFiles(n) {
const tree = {};
for (let i = 0; i < n; i++) {
// Pad indices so localeCompare gives the natural order for any N.
tree[`f${String(i).padStart(4, '0')}.ts`] = 'export const x = 1;\n';
}
return setupTree(tree);
}
it('30 files -> small (upper boundary of small)', () => {
projectRoot = setupNFiles(30);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.totalFiles).toBe(30);
expect(r.output.estimatedComplexity).toBe('small');
});
it('31 files -> moderate (lower boundary of moderate)', () => {
projectRoot = setupNFiles(31);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.totalFiles).toBe(31);
expect(r.output.estimatedComplexity).toBe('moderate');
});
it('150 files -> moderate (upper boundary of moderate)', () => {
projectRoot = setupNFiles(150);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.totalFiles).toBe(150);
expect(r.output.estimatedComplexity).toBe('moderate');
});
it('151 files -> large (lower boundary of large)', () => {
projectRoot = setupNFiles(151);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.totalFiles).toBe(151);
expect(r.output.estimatedComplexity).toBe('large');
});
it('501 files -> very-large (lower boundary of very-large)', () => {
projectRoot = setupNFiles(501);
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output.totalFiles).toBe(501);
expect(r.output.estimatedComplexity).toBe('very-large');
});
});
describe('scan-project.mjs — CLI entry guard + invocation', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('invokes successfully via subprocess and produces a parseable output file', () => {
projectRoot = setupTree({
'README.md': '# proj\n',
'src/index.ts': 'export const x = 1;\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
expect(r.output).not.toBeNull();
expect(r.output.scriptCompleted).toBe(true);
// Stats summary line fires on stderr.
expect(r.stderr).toMatch(
/scan-project: filesScanned=2 filteredByIgnore=0 complexity=small/,
);
// Two files captured.
expect(r.output.totalFiles).toBe(2);
});
it('fails fast with usage message when projectRoot is missing', () => {
const result = spawnSync('node', [SCRIPT], { encoding: 'utf-8' });
expect(result.status).toBe(1);
expect(result.stderr).toMatch(/Usage: node scan-project\.mjs/);
});
});
describe('scan-project.mjs — output schema invariants', () => {
let projectRoot;
afterEach(() => {
if (projectRoot) {
rmSync(projectRoot, { recursive: true, force: true });
projectRoot = null;
}
});
it('emits the documented top-level fields with correct shapes', () => {
projectRoot = setupTree({
'src/a.ts': 'export const a = 1;\n',
'README.md': '# x\n',
'package.json': '{}\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
const out = r.output;
expect(out.scriptCompleted).toBe(true);
expect(Array.isArray(out.files)).toBe(true);
expect(typeof out.totalFiles).toBe('number');
expect(out.totalFiles).toBe(out.files.length);
expect(typeof out.filteredByIgnore).toBe('number');
expect(['small', 'moderate', 'large', 'very-large']).toContain(
out.estimatedComplexity,
);
expect(out.stats).toBeDefined();
expect(out.stats.filesScanned).toBe(out.files.length);
expect(typeof out.stats.byCategory).toBe('object');
expect(typeof out.stats.byLanguage).toBe('object');
// Per-file shape
for (const f of out.files) {
expect(typeof f.path).toBe('string');
expect(typeof f.language).toBe('string');
expect(typeof f.sizeLines).toBe('number');
expect([
'code', 'config', 'docs', 'infra', 'data', 'script', 'markup',
]).toContain(f.fileCategory);
}
});
it('files[] is sorted by path.localeCompare', () => {
projectRoot = setupTree({
'zzz.ts': '\n',
'aaa.ts': '\n',
'mmm.ts': '\n',
'subdir/file.ts': '\n',
});
const r = runScript(projectRoot);
expect(r.status).toBe(0);
const paths = r.output.files.map(f => f.path);
const sortedPaths = [...paths].sort((a, b) => a.localeCompare(b));
expect(paths).toEqual(sortedPaths);
});
});
@@ -1,7 +1,7 @@
{
"name": "understand-anything",
"description": "AI-powered codebase understanding — analyze, visualize, and explain any project",
"version": "2.7.4",
"version": "2.7.5",
"author": {
"name": "Lum1104"
},
@@ -52,6 +52,18 @@ cat > $PROJECT_ROOT/.understand-anything/tmp/ua-file-analyzer-input-<batchIndex>
ENDJSON
```
### Cross-batch context (neighborMap)
Your dispatch prompt includes a `neighborMap` — for each file in your batch, it lists project-internal neighbors in OTHER batches (files that import yours or that you import), with their exported symbols.
Use neighborMap as a confidence boost for cross-batch edges (`calls`, `related`, `inherits`, `implements` to nodes outside your batch):
- If your source clearly references a symbol that appears in some `neighbor.symbols`, emit the edge to `function:<neighbor.path>:<symbol>` or `class:<neighbor.path>:<symbol>` with confidence.
- If your source references a cross-batch symbol that is NOT in neighborMap (the project-scanner may not have extracted it), you may still emit the edge if you saw it explicitly in the imported file's surface — but prefer matching neighborMap symbols when available.
- Imports continue to use `batchImportData` (fully resolved), not neighborMap.
The merge script's dangling-edge dropper is the safety net for genuinely unresolvable targets.
### Step 2 — Execute the bundled extraction script
Run the bundled `extract-structure.mjs` script. The `<SKILL_DIR>` path is provided in your dispatch prompt.
@@ -464,12 +476,46 @@ Use these hints for common edge patterns:
- NEVER create self-referencing edges (where source equals target).
- Trust the script's structural extraction. Do NOT re-read source files to re-extract functions, classes, or imports that the script already captured. Only re-read a file if you need deeper understanding for writing a summary.
## Writing Results
## Writing Results — single or multi-part
After producing the JSON:
### Output File Naming — STRICT
1. Write the JSON to: `<project-root>/.understand-anything/intermediate/batch-<batchIndex>.json`
2. The project root and batch index will be provided in your prompt.
3. Respond with ONLY a brief text summary: number of nodes created (by type), number of edges created, and any files that were skipped.
**For EVERY batch in your input, write a separate output file using ONLY one of these two filename patterns:**
Do NOT include the full JSON in your text response.
- `batch-<batchIndex>.json` — single-part output for batch `<batchIndex>`
- `batch-<batchIndex>-part-<k>.json` — multi-part output when `nodes > 60` or `edges > 120` (per Step B below)
`<batchIndex>` is the **ORIGINAL integer batch index** from the input `batches.json`. Even if your dispatch prompt fused multiple batches into one call (e.g., for token efficiency — input may be labeled `fused-8-13` or contain `batches: [{batchIndex: 8}, {batchIndex: 9}, ...]`), you MUST split your output back into per-batch files using each original `batchIndex`.
**NEVER use these patterns:** `batch-fused-*`, `batch-merged-*`, `batch-N-M-*` (range like `batch-8-13.json`), `batches-*`, or any other variant. The downstream merge script (`merge-batch-graphs.py`) requires the regex `batch-(\d+)(?:-part-(\d+))?\.json` — anything else is **silently dropped from the final graph**, losing every node and edge in that file with no error.
**Example.** If your input contained 6 batches (indices 8 through 13), you write EXACTLY 6 output files: `batch-8.json`, `batch-9.json`, `batch-10.json`, `batch-11.json`, `batch-12.json`, `batch-13.json`. Not one combined `batch-fused-8-13.json`. Not one `batch-8-13.json`. Six files, one per original `batchIndex`. Run Steps AF below independently for each batch's nodes/edges.
**Step A — Compute totals.**
```
nodeCount = nodes.length
edgeCount = edges.length
```
**Step B — Decide split.**
- If `nodeCount ≤ 60` AND `edgeCount ≤ 120`: write ONE file to `.understand-anything/intermediate/batch-<batchIndex>.json`. Done. Skip to Step F.
- Otherwise: `parts = ceil(max(nodeCount / 60, edgeCount / 120))`.
**Step C — Partition.**
Sort files in your batch alphabetically by path. Chunk them sequentially into `parts` groups of size `ceil(N / parts)`. For each part:
- All nodes whose `filePath` is in this part's files (for non-file nodes like `module`/`concept`, use the file they belong to).
- All edges whose `source` is in this part's nodes (target may be anywhere — same part, different part of same batch, different batch).
**Step D — Write each part.**
Write part `k` (1-indexed) to `.understand-anything/intermediate/batch-<batchIndex>-part-<k>.json`. Each part is a valid GraphFragment: `{ "nodes": [...], "edges": [...] }`.
**Step E — Self-validate.**
For each file written, verify:
- Valid JSON.
- `nodes` array exists and is well-formed.
- For every edge: `source` and `target` both appear as either (a) a node `id` in this part's nodes, OR (b) a `file:<path>` reference where `<path>` is in `neighborMap` or `batchImportData`, OR (c) a `function:<path>:<symbol>` / `class:<path>:<symbol>` reference where `<symbol>` is in some `neighbor.symbols`.
If validation fails on a part, do NOT silently rebuild. Respond with an explicit error stating which part failed, which edge(s) failed validation, and why. The dispatching session can then retry.
**Step F — Respond.**
Respond with ONLY a brief text summary: parts written (1 or more), total nodes/edges across all parts, any files skipped. Do NOT include JSON content in the response.
@@ -12,246 +12,59 @@ You are a meticulous project inventory specialist. Your job is to scan a codebas
## Task
Scan the project directory provided in the prompt and produce a JSON inventory. You will accomplish this in two phases: first, write and execute a discovery script that performs all deterministic file scanning; second, review the script's results and add a human-readable project description.
Scan the project directory provided in the prompt and produce a JSON inventory. The work splits into deterministic and LLM-driven parts:
- **Deterministic** (file enumeration, language detection, category assignment, line counting, complexity estimation, `.understandignore` filtering, import resolution) is handled by two bundled scripts: `scan-project.mjs` and `extract-import-map.mjs`. Do NOT re-implement any of this logic.
- **LLM** (reading README + manifests for the narrative `name` / `description` / `frameworks` / `languages` story) is what you contribute.
**Language directive:** If the dispatch prompt includes a language directive (e.g., "Generate all textual content in **Chinese**"), apply it to the `description` field you synthesize in Phase 2. Write the description in the specified language using natural, native-level phrasing. Keep technical terms in English when no standard translation exists (e.g., "middleware", "hook", "barrel").
---
## Phase 1 -- Discovery Script
## Phase 1 -- Discovery (bundled scan + LLM narrative)
Write a script that discovers all project files (including non-code files like configs, docs, and infrastructure), detects languages and frameworks, counts lines, and produces structured JSON. Prefer Node.js for the script; fall back to Python if Node.js is unavailable. Avoid bash for this task — import resolution requires file reading and path manipulation that bash handles poorly. The script must handle errors gracefully and never crash on unexpected input.
Phase 1 has three orchestrated steps. Steps **B** and **C** run bundled scripts; step **A** is the only LLM work in this phase.
### Script Requirements
### Step A (LLM) -- Read manifests and README for narrative fields
1. **Accept** the project root directory as `$1` (bash) or `process.argv[2]` (Node.js) or `sys.argv[1]` (Python).
2. **Write** results JSON to the path given as `$2` / `process.argv[3]` / `sys.argv[2]`.
3. **Exit 0** on success.
4. **Exit 1** on fatal error (cannot access directory, etc.). Print the error to stderr.
Read the top-level project files to gather narrative metadata. Do NOT walk the file tree or count files yourself — that is Step B's job.
### What the Script Must Do
Read whichever of these exist at the project root:
- `README.md` (or `README.rst`, `README`) — capture the first ~10 lines for narrative grounding
- `package.json` — extract `name`, `description`, plus `dependencies` / `devDependencies` keys for framework detection
- `pyproject.toml`, `setup.py`, `setup.cfg`, `Pipfile`, `requirements.txt` — Python framework signals
- `Cargo.toml` — Rust project name + `[dependencies]`
- `go.mod` — Go module name + `require` block
- `Gemfile` — Ruby framework signals
- `pom.xml`, `build.gradle`, `build.gradle.kts` — JVM project signals
- `composer.json` — PHP project signals
**Step 1 -- File Discovery**
From these, synthesize:
Discover all tracked files. In order of preference:
- Run `git ls-files` in the project root (most reliable for git repos)
- Fall back to a recursive file listing with exclusions if not a git repo
- **`name`** -- in priority order: `package.json` `name`, `Cargo.toml` `[package].name`, `go.mod` module path's last segment, `pyproject.toml` `[project].name` or `[tool.poetry].name`, else the directory name of the project root.
- **`rawDescription`** -- the `description` field from `package.json` (or its equivalent in the matching manifest), or `""` if none.
- **`readmeHead`** -- the first ~10 lines of `README.md` (or equivalent), or `""` if no README exists.
- **`frameworks`** -- match dependency names against known frameworks: `react`, `vue`, `svelte`, `@angular/core`, `express`, `fastify`, `koa`, `next`, `nuxt`, `vite`, `vitest`, `jest`, `mocha`, `tailwindcss`, `prisma`, `typeorm`, `sequelize`, `mongoose`, `redux`, `zustand`, `mobx`; Python: `django`, `djangorestframework`, `fastapi`, `flask`, `sqlalchemy`, `alembic`, `celery`, `pydantic`, `uvicorn`, `gunicorn`, `aiohttp`, `tornado`, `starlette`, `pytest`, `hypothesis`, `channels`; Ruby: `rails`, `railties`, `sinatra`, `grape`, `rspec`, `sidekiq`, `activerecord`, `actionpack`, `devise`, `pundit`; Go: `github.com/gin-gonic/gin`, `github.com/labstack/echo`, `github.com/gofiber/fiber`, `github.com/go-chi/chi`, `gorm.io/gorm`; Rust: `actix-web`, `axum`, `rocket`, `diesel`, `tokio`, `serde`, `warp`; JVM: `spring-boot`, `spring-web`, `spring-data`, `quarkus`, `micronaut`, `hibernate`, `jakarta`, `junit`, `ktor`. Also infer infrastructure tools from manifest presence: add `Docker` if `Dockerfile` exists in the file list, `Docker Compose` if `docker-compose.yml`/`docker-compose.yaml` exists, `Terraform` if any `*.tf`, `GitHub Actions` if `.github/workflows/*.yml`, `GitLab CI` if `.gitlab-ci.yml`, `Jenkins` if `Jenkinsfile`.
- **`languages`** -- the deduplicated, alphabetically-sorted top-level language set you observe across the manifests + the bundled script's per-file language tally (you will read this from Step B's output).
**Step 2 -- Exclusion Filtering**
If the manifest is missing or malformed, leave the corresponding field empty rather than guessing.
Remove ALL files matching these patterns:
- **Dependency directories:** paths containing `node_modules/`, `.git/`, `vendor/`, `venv/`, `.venv/`, `__pycache__/`
- **Build output:** paths with a directory segment matching `dist/`, `build/`, `out/`, `coverage/`, `.next/`, `.cache/`, `.turbo/`, `target/` (Rust), `obj/` (.NET) — match full directory segments only, not substrings (e.g., `buildSrc/` should NOT be excluded). Note: `bin/` is NOT excluded by default because Node.js and Ruby projects use `bin/` for CLI launchers; .NET users can add `bin/` to `.understandignore`.
- **Lock files:** `*.lock`, `package-lock.json`, `yarn.lock`, `pnpm-lock.yaml`
- **Binary/asset files:** `.png`, `.jpg`, `.jpeg`, `.gif`, `.svg`, `.ico`, `.woff`, `.woff2`, `.ttf`, `.eot`, `.mp3`, `.mp4`, `.pdf`, `.zip`, `.tar`, `.gz`
- **Generated files:** `*.min.js`, `*.min.css`, `*.map`, `*.generated.*` (note: do NOT exclude `*.d.ts` — many projects have hand-written declaration files)
- **IDE/editor config:** paths containing `.idea/`, `.vscode/`
- **Misc non-source:** `LICENSE`, `.gitignore`, `.editorconfig`, `.prettierrc`, `.eslintrc*`, `*.log`
### Step B (bundled `scan-project.mjs`) -- File enumeration + language + category + lines
**IMPORTANT:** Do NOT exclude non-code project files. The following MUST be kept:
- Documentation: `*.md`, `*.rst`, `*.txt` (except `LICENSE`)
- Configuration: `*.yaml`, `*.yml`, `*.json`, `*.toml`, `*.xml`, `*.cfg`, `*.ini`, `*.env`, `*.env.example` (include `.env` in the file list but downstream agents should NEVER include `.env` variable values in summaries or output)
- Infrastructure: `Dockerfile`, `docker-compose.*`, `*.tf`, `Makefile`, `Jenkinsfile`, `Procfile`, `Vagrantfile`
- CI/CD: `.github/workflows/*`, `.gitlab-ci.yml`, `.circleci/*`, `Jenkinsfile`
- Data/Schema: `*.sql`, `*.graphql`, `*.gql`, `*.proto`, `*.prisma`, `*.schema.json`
- Web markup: `*.html`, `*.css`, `*.scss`, `*.sass`, `*.less`
- Shell scripts: `*.sh`, `*.bash`, `*.ps1`, `*.bat`
- Kubernetes: `*.k8s.yaml`, `*.k8s.yml`, paths containing `k8s/`, paths containing `kubernetes/`
Invoke the bundled scan script. It walks the project (preferring `git ls-files`, falling back to a recursive walk for non-git directories), applies `.understandignore` filtering (defaults + user patterns), assigns `language` and `fileCategory` per the canonical tables, counts lines, and writes deterministic JSON. You do not see or maintain those tables — they live in the script.
**Note on package manifests:** Config files read for framework detection (`package.json`, `tsconfig.json`, `Cargo.toml`, `go.mod`, `pyproject.toml`, etc.) should also appear in the file list with `fileCategory: "config"`.
**Step 2.5 -- User-Configured Filtering (.understandignore)**
When `.understandignore` files exist, **replace** Step 2's hardcoded filtering with a unified filter that combines defaults and user patterns in a single pass. This ensures `!` negation patterns can override defaults.
1. Check if `$PROJECT_ROOT/.understand-anything/.understandignore` exists. If so, read it.
2. Check if `$PROJECT_ROOT/.understandignore` exists. If so, read it.
3. If neither file exists, skip this step entirely — Step 2's hardcoded filtering is sufficient.
4. If at least one file exists, re-filter the **original file list from Step 1** (not the Step 2 output) using the `createIgnoreFilter` function from `@understand-anything/core`, which merges hardcoded defaults and user patterns into a single `.gitignore`-compatible matcher. This ensures `!` negation in user files can override hardcoded defaults (e.g., `!dist/` force-includes dist/ files).
5. Track the count of additional files removed beyond Step 2's baseline as `filteredByIgnore`.
This filtering must be deterministic (not LLM-based). Use a Node.js script with the `ignore` npm package from `@understand-anything/core`.
**Step 3 -- Language Detection**
Map file extensions to language identifiers:
| Extensions | Language ID |
|---|---|
| `.ts`, `.tsx` | `typescript` |
| `.js`, `.jsx` | `javascript` |
| `.py` | `python` |
| `.go` | `go` |
| `.rs` | `rust` |
| `.java` | `java` |
| `.rb` | `ruby` |
| `.cpp`, `.cc`, `.cxx`, `.h`, `.hpp` | `cpp` |
| `.c` | `c` |
| `.cs` | `csharp` |
| `.swift` | `swift` |
| `.kt` | `kotlin` |
| `.php` | `php` |
| `.vue` | `vue` |
| `.svelte` | `svelte` |
| `.sh`, `.bash` | `shell` |
| `.ps1` | `powershell` |
| `.bat`, `.cmd` | `batch` |
| `.md`, `.rst` | `markdown` |
| `.yaml`, `.yml` | `yaml` |
| `.json` | `json` |
| `.jsonc` | `jsonc` |
| `.toml` | `toml` |
| `.sql` | `sql` |
| `.graphql`, `.gql` | `graphql` |
| `.proto` | `protobuf` |
| `.tf`, `.tfvars` | `terraform` |
| `.html`, `.htm` | `html` |
| `.css`, `.scss`, `.sass`, `.less` | `css` |
| `.xml` | `xml` |
| `.cfg`, `.ini`, `.env` | `config` |
| `Dockerfile` (no extension) | `dockerfile` |
| `Makefile` (no extension) | `makefile` |
| `Jenkinsfile` (no extension) | `jenkinsfile` |
**Fallback:** If a file's extension is not in the table above, set `language` to the lowercased extension (without the leading dot), or `"unknown"` if there is no extension. Never emit `null` — downstream consumers rely on this field being a string.
Collect unique languages, sorted alphabetically.
**Step 4 -- File Category Detection**
Assign a `fileCategory` to each discovered file based on its extension and path:
| Pattern | Category |
|---|---|
| `.md`, `.rst`, `.txt` (except `LICENSE`) | `docs` |
| `.yaml`, `.yml`, `.json`, `.jsonc`, `.toml`, `.xml`, `.cfg`, `.ini`, `.env`, `tsconfig.json`, `package.json`, `pyproject.toml`, `Cargo.toml`, `go.mod` | `config` |
| `Dockerfile`, `docker-compose.*`, `.tf`, `.tfvars`, `Makefile`, `Jenkinsfile`, `Procfile`, `Vagrantfile`, `.github/workflows/*`, `.gitlab-ci.yml`, `.circleci/*`, `*.k8s.yaml`, `*.k8s.yml`, paths in `k8s/` or `kubernetes/` | `infra` |
| `.sql`, `.graphql`, `.gql`, `.proto`, `.prisma`, `*.schema.json`, `.csv` | `data` |
| `.sh`, `.bash`, `.ps1`, `.bat` | `script` |
| `.html`, `.htm`, `.css`, `.scss`, `.sass`, `.less` | `markup` |
| All other extensions (`.ts`, `.tsx`, `.js`, `.py`, `.go`, `.rs`, etc.) | `code` |
**Priority rule:** When a file matches multiple categories, use the first match from the table above (most specific wins). For example, `docker-compose.yml` is `infra`, not `config`.
**Step 5 -- Line Counting**
For each file, count lines using `wc -l`. For efficiency:
- If fewer than 500 files, count all of them
- If 500+ files, count all of them but batch the `wc -l` calls (pass multiple files per invocation to avoid spawning thousands of processes)
**Step 6 -- Framework Detection**
Read config files (if they exist) and extract framework information:
- `package.json` -- parse JSON, extract `name`, `description`, `dependencies`, `devDependencies`. Match dependency names against known frameworks: `react`, `vue`, `svelte`, `@angular/core`, `express`, `fastify`, `koa`, `next`, `nuxt`, `vite`, `vitest`, `jest`, `mocha`, `tailwindcss`, `prisma`, `typeorm`, `sequelize`, `mongoose`, `redux`, `zustand`, `mobx`
- `tsconfig.json` -- if present, confirms TypeScript usage
- `Cargo.toml` -- if present, confirms Rust project; extract `[package].name`
- `go.mod` -- if present, confirms Go project; extract module name
- `requirements.txt` -- if present, confirms Python project; read line by line and match package names (strip version specifiers) against known Python frameworks: `django`, `djangorestframework`, `fastapi`, `flask`, `sqlalchemy`, `alembic`, `celery`, `pydantic`, `uvicorn`, `gunicorn`, `aiohttp`, `tornado`, `starlette`, `pytest`, `hypothesis`, `channels`
- `pyproject.toml` -- if present, confirms Python project; parse the `[project].dependencies` or `[tool.poetry.dependencies]` section and apply the same Python framework keyword matching as above. Also check for `[tool.pytest.ini_options]` (confirms pytest) and `[tool.django]` (confirms Django).
- `setup.py` / `setup.cfg` / `Pipfile` -- if present, confirms Python project; read and apply Python framework keyword matching
- `Gemfile` -- if present, confirms Ruby project; read and match gem names against known Ruby frameworks: `rails`, `railties`, `sinatra`, `grape`, `rspec`, `sidekiq`, `activerecord`, `actionpack`, `devise`, `pundit`
- `go.mod` dependencies -- if present, read the `require` block and match module paths against known Go frameworks: `github.com/gin-gonic/gin`, `github.com/labstack/echo`, `github.com/gofiber/fiber`, `github.com/go-chi/chi`, `gorm.io/gorm`
- `Cargo.toml` dependencies -- if present, read `[dependencies]` and match crate names against known Rust frameworks: `actix-web`, `axum`, `rocket`, `diesel`, `tokio`, `serde`, `warp`
- `pom.xml` / `build.gradle` / `build.gradle.kts` -- if present, confirms Java/Kotlin project; match dependency names against known JVM frameworks: `spring-boot`, `spring-web`, `spring-data`, `quarkus`, `micronaut`, `hibernate`, `jakarta`, `junit`, `ktor`
Also detect infrastructure tooling from discovered files:
- Presence of `Dockerfile` -> add `Docker` to frameworks
- Presence of `docker-compose.yml` or `docker-compose.yaml` -> add `Docker Compose` to frameworks
- Presence of `*.tf` files -> add `Terraform` to frameworks
- Presence of `.github/workflows/*.yml` -> add `GitHub Actions` to frameworks
- Presence of `.gitlab-ci.yml` -> add `GitLab CI` to frameworks
- Presence of `Jenkinsfile` -> add `Jenkins` to frameworks
**Step 7 -- Complexity Estimation**
Classify by total file count (including non-code files):
- `small`: 1-30 files
- `moderate`: 31-150 files
- `large`: 151-500 files
- `very-large`: >500 files
**Step 8 -- Project Name**
Extract from (in priority order):
1. `package.json` `name` field
2. `Cargo.toml` `[package].name`
3. `go.mod` module path (last segment)
4. `pyproject.toml` -- check `[project].name` first, then `[tool.poetry].name`
5. Directory name of project root
**Step 9 -- Import Resolution**
For each **code-category** file in the discovered list (`fileCategory === "code"`), extract and resolve relative import statements. The goal is to produce a map from each file's path to the list of project-internal files it imports. External package imports are ignored.
**Non-code files** (config, docs, infra, data, script, markup) should have an empty array `[]` in the import map — they do not participate in code-level import resolution.
For each code file, read its content and extract import paths using language-appropriate patterns:
| Language | Import patterns to match |
|---|---|
| TypeScript/JavaScript | Relative: `import ... from './...'` or `'../'`, `require('./...')` or `require('../...')`. **Plus path aliases** from `tsconfig.json` `compilerOptions.paths` and `baseUrl` (e.g. `@/foo``<baseUrl>/foo`, `~/foo``<baseUrl>/foo`). Read tsconfig.json (if present) and resolve every alias prefix against the discovered file list with the standard extension probes. |
| Python | Both relative AND absolute. Relative: `from .x import y`, `from ..x import y`, `from . import x`. Absolute: `import a.b.c`, `from a.b.c import x[, y, ...]` — try every dotted path against the discovered file list (see resolution algorithm below) and keep matches; non-matches are external packages and are dropped. |
| Go | Paths in `import (...)` blocks that start with the module path from `go.mod` |
| Rust | `use crate::`, `use super::`, `mod x` (within the same crate) |
| Java | `import com.example.foo.Bar;` — try `**/com/example/foo/Bar.java` against the discovered file list; keep matches |
| Kotlin | `import com.example.foo.Bar` — try `**/com/example/foo/Bar.kt` against the discovered file list; keep matches |
| Ruby | Relative: `require_relative '...'` paths. **Plus** `require 'foo/bar'` (load-path) — try `lib/foo/bar.rb`, `app/foo/bar.rb`, `foo/bar.rb` against the discovered file list. |
| PHP | `use Vendor\Pkg\Class;` — read `composer.json` `autoload.psr-4` map (e.g. `"App\\": "src/"`), translate the namespace prefix to its directory, then try `<dir>/Pkg/Class.php` against the discovered file list. Skip imports whose namespace prefix isn't in the autoload map. |
| C / C++ | `#include "foo.h"` (relative to the includer's directory) and `#include <foo.h>` — for both, also probe `include/foo.h`, `src/foo.h`, and the bare path against the discovered file list. Match `.h`, `.hpp`, `.hxx`, `.cuh`. |
For each extracted import path:
1. Compute the resolved file path relative to project root:
- For relative imports (`./x`, `../x`): resolve from the importing file's directory
- Try these extension variants in order if the import has no extension: `.ts`, `.tsx`, `.js`, `.jsx`, `/index.ts`, `/index.js`, `/index.tsx`, `/index.jsx`, `.py`, `.go`, `.rs`, `.rb`
2. Check if the resolved path exists in the discovered file list
3. If yes: add to this file's resolved imports list
4. If no: skip (external, unresolvable, or dynamic import)
**Python absolute imports — resolution algorithm.** This is the dominant import style in real Python projects, so it MUST be handled:
For `import a.b.c`, try (in order, take first match in the discovered file list):
- `a/b/c.py`
- `a/b/c/__init__.py`
For `from a.b.c import x, y, z`, try (in order, take first match for the module path):
- `a/b/c.py`
- `a/b/c/__init__.py`
If the module path matched as a package (`__init__.py`), additionally probe each imported name `x`/`y`/`z` against:
- `a/b/c/x.py`
- `a/b/c/x/__init__.py`
so that `from package import submodule` resolves to the submodule file. Skip names that don't match (they're class/function imports from inside the package, already covered by the `__init__.py` match).
If NO probe matches, the import is external — drop it.
**Worked example.** Discovered files include `src/utils/formatter.py`, `src/utils/__init__.py`. The line `from src.utils import formatter` resolves to `src/utils/__init__.py` (module match) AND `src/utils/formatter.py` (submodule probe). Both are added to the importer's resolved list.
Output format in the script result:
```json
"importMap": {
"src/index.ts": ["src/utils.ts", "src/config.ts"],
"src/utils.ts": [],
"README.md": [],
"Dockerfile": [],
"src/components/App.tsx": ["src/hooks/useAuth.ts", "src/store/index.ts"]
}
```bash
mkdir -p $PROJECT_ROOT/.understand-anything/tmp
node $PLUGIN_ROOT/skills/understand/scan-project.mjs \
"$PROJECT_ROOT" \
"$PROJECT_ROOT/.understand-anything/tmp/ua-scan-files.json"
```
Keys are project-relative paths. Values are arrays of resolved project-relative paths. Every key in the file list must appear in `importMap` (use an empty array `[]` if no imports were resolved). External packages and unresolvable imports are omitted entirely.
### Script Output Format
The script must write this exact JSON structure to the output file:
Output JSON shape (you will read this verbatim and merge into the final scan-result):
```json
{
"scriptCompleted": true,
"name": "project-name",
"rawDescription": "Description from package.json or empty string",
"readmeHead": "First 10 lines of README.md or empty string",
"languages": ["javascript", "markdown", "typescript", "yaml"],
"frameworks": ["React", "Vite", "Vitest", "Docker"],
"files": [
{"path": "src/index.ts", "language": "typescript", "sizeLines": 150, "fileCategory": "code"},
{"path": "README.md", "language": "markdown", "sizeLines": 45, "fileCategory": "docs"},
@@ -261,50 +74,106 @@ The script must write this exact JSON structure to the output file:
"totalFiles": 42,
"filteredByIgnore": 0,
"estimatedComplexity": "moderate",
"importMap": {
"src/index.ts": ["src/utils.ts", "src/config.ts"],
"src/utils.ts": [],
"README.md": [],
"Dockerfile": [],
"package.json": []
"stats": {
"filesScanned": 42,
"byCategory": {"code": 28, "config": 6, "docs": 4, "infra": 2, "script": 2},
"byLanguage": {"typescript": 22, "javascript": 6, "json": 5, "markdown": 4, "yaml": 3, "shell": 2}
}
}
```
- `scriptCompleted` (boolean) -- always `true` when the script finishes normally
- `name` (string) -- project name extracted from config or directory name
- `rawDescription` (string) -- raw description from `package.json` or empty string
- `readmeHead` (string) -- first 10 lines of `README.md` or empty string if no README exists
- `languages` (string[]) -- deduplicated, sorted alphabetically
- `frameworks` (string[]) -- only confirmed frameworks; empty array if none detected
- `files` (object[]) -- every discovered file, sorted by `path` alphabetically
- `files[].fileCategory` (string) -- one of: `code`, `config`, `docs`, `infra`, `data`, `script`, `markup`
- `totalFiles` (integer) -- must equal `files.length`
- `filteredByIgnore` (integer) -- count of files removed by `.understandignore` patterns in Step 2.5; 0 if no `.understandignore` file exists
- `estimatedComplexity` (string) -- one of `small`, `moderate`, `large`, `very-large`
- `importMap` (object) -- map from every file path to its list of resolved project-internal import paths; empty array for non-code files and files with no resolved imports; external packages excluded
The script:
- sorts `files` by `path.localeCompare` (deterministic)
- emits `fileCategory ∈ {code, config, docs, infra, data, script, markup}` per file (priority-ordered per the rules below)
- emits `language` as a non-null string for every file (canonical id for known extensions, lowercased extension for unknowns, `"unknown"` for no-extension files that don't match `Dockerfile` / `Makefile` / `Jenkinsfile`)
- counts `filteredByIgnore` as the delta beyond hardcoded defaults — `!`-negation in `.understandignore` correctly re-includes files
- emits `Warning: scan-project: <path> — <reason> — file skipped from output` on stderr for per-file failures (permission denied, malformed unicode, vanished file). Capture these and append to phase warnings.
- emits `scan-project: filesScanned=… filteredByIgnore=… complexity=…` as the final stderr summary line; informational only.
### Executing the Script
**Canonical category table** (for the record — the script is authoritative; do NOT re-derive these rules in your prompt):
After writing the script, execute it. `$PROJECT_ROOT` is the project root directory provided in your dispatch prompt:
| Pattern | Category |
|---|---|
| `LICENSE` | `code` (exception — not docs) |
| `Dockerfile`, `Dockerfile.*`, `docker-compose.*`, `compose.yml`/`compose.yaml`, `Makefile`, `Jenkinsfile`, `Procfile`, `Vagrantfile`, `.gitlab-ci.yml`, `.dockerignore`, `.github/workflows/*`, `.circleci/*`, paths in `k8s/` or `kubernetes/`, `*.k8s.yml`/`*.k8s.yaml` | `infra` |
| `.md`, `.mdx`, `.rst`, `.txt`, `.text` (except `LICENSE`) | `docs` |
| `.yaml`, `.yml`, `.json`, `.jsonc`, `.toml`, `.xml`, `.xsl`, `.xsd`, `.plist`, `.cfg`, `.ini`, `.env`, `.properties`, `.csproj`, `.sln`, `.mod`, `.sum`, `.gradle` | `config` |
| `.tf`, `.tfvars` | `infra` |
| `.sql`, `.graphql`, `.gql`, `.proto`, `.prisma`, `.csv`, `.tsv` | `data` |
| `.sh`, `.bash`, `.zsh`, `.ps1`, `.psm1`, `.psd1`, `.bat`, `.cmd` | `script` |
| `.html`, `.htm`, `.css`, `.scss`, `.sass`, `.less` | `markup` |
| Everything else | `code` |
**Priority rule:** most-specific wins. Filename / path rules fire before extension rules — e.g., `docker-compose.yml` is `infra` (not `config`); `.github/workflows/ci.yml` is `infra` (not `config`); `LICENSE` is `code` (not `docs`).
**`.understandignore` behavior:** the bundled script reads `.understandignore` and `.understand-anything/.understandignore` if present and merges them with the hardcoded defaults via `createIgnoreFilter`. `!`-negation overrides defaults (`!dist/` would re-include `dist/` files). The `filteredByIgnore` counter measures only user-driven drops, not baseline default drops.
If the script exits with a non-zero status, read stderr to diagnose. You have up to 2 retry attempts (re-invocations) before failing the phase. Do NOT attempt to substitute a custom scanner — there is no second-source replacement.
### Step C -- Import Resolution (bundled `extract-import-map.mjs`)
After Step B has produced the file list, invoke the bundled `extract-import-map.mjs` script for deterministic import extraction across all supported code languages. It uses tree-sitter for parsing and applies language-specific resolution rules in code (see `<SKILL_DIR>/extract-import-map.mjs`).
**Do not** attempt to re-implement import patterns. Step B emits `path`/`language`/`fileCategory` for every file; this script consumes that list and produces the `importMap`.
Write the input JSON for the bundled script (the `files[]` array is exactly Step B's `files[]` — pass it through verbatim):
```bash
node $PROJECT_ROOT/.understand-anything/tmp/ua-project-scan.js "$PROJECT_ROOT" "$PROJECT_ROOT/.understand-anything/tmp/ua-scan-results.json"
mkdir -p $PROJECT_ROOT/.understand-anything/tmp
cat > $PROJECT_ROOT/.understand-anything/tmp/ua-import-map-input.json << 'ENDJSON'
{
"projectRoot": "<absolute-project-root>",
"files": [
{"path": "src/index.ts", "language": "typescript", "fileCategory": "code"},
{"path": "README.md", "language": "markdown", "fileCategory": "docs"}
]
}
ENDJSON
```
(Or the equivalent for Python, depending on which language you chose.)
Then run:
If the script exits with a non-zero code, read stderr, diagnose the issue, fix the script, and re-run. You have up to 2 retry attempts.
```bash
node $PLUGIN_ROOT/skills/understand/extract-import-map.mjs \
$PROJECT_ROOT/.understand-anything/tmp/ua-import-map-input.json \
$PROJECT_ROOT/.understand-anything/tmp/ua-import-map-output.json
```
The output JSON has shape:
```json
{
"scriptCompleted": true,
"stats": { "filesScanned": 314, "filesWithImports": 142, "totalEdges": 487 },
"importMap": {
"src/index.ts": ["src/utils.ts", "src/config.ts"],
"src/utils.ts": [],
"README.md": [],
"Dockerfile": []
}
}
```
Read the output JSON and merge the `importMap` field directly into your final scan-result.json (under the same key — `importMap`). The format matches the project-scanner contract: every input file has an entry; non-code files have empty arrays; resolved internal paths only (external packages are dropped).
**Capture stderr** when you run the bundled script. Any line starting with `Warning:` should be appended to phase warnings — the SKILL.md orchestrator captures these for the final report. The script also writes a one-line summary `extract-import-map: filesScanned=… filesWithImports=… totalEdges=…` on completion; you can ignore that line or surface it as informational.
**Languages supported.** The bundled script natively handles import resolution for: TypeScript, JavaScript (including CJS `require()`), Python (relative + absolute + `__init__.py`), Go (go.mod prefix stripping), Rust (`use crate::`, `use super::`, `use self::`, and `mod x;` declarations), Java, Kotlin, C#, Ruby (`require` + `require_relative`), PHP (composer.json PSR-4 autoload), C, and C++ (`#include` with relative + include/ + src/ probes). Languages outside this set get empty arrays — there is no LLM-based fallback.
---
## Phase 2 -- Description and Final Assembly
After the script completes, read `$PROJECT_ROOT/.understand-anything/tmp/ua-scan-results.json`. Do NOT re-run file discovery commands or re-count lines -- trust the script's results entirely.
After Steps A + B + C have all completed, read:
1. `$PROJECT_ROOT/.understand-anything/tmp/ua-scan-files.json` — output of `scan-project.mjs` (file list with language, sizeLines, fileCategory; plus `totalFiles`, `filteredByIgnore`, `estimatedComplexity`).
2. `$PROJECT_ROOT/.understand-anything/tmp/ua-import-map-output.json` — output of `extract-import-map.mjs` (the `importMap` field).
3. Your Step A in-memory notes (`name`, `rawDescription`, `readmeHead`, `frameworks`, `languages` narrative).
**IMPORTANT:** The final output must NOT contain the `scriptCompleted`, `rawDescription`, or `readmeHead` fields. These are intermediate script fields only. Strip them when assembling the final JSON. All other fields — including `importMap` — MUST be preserved exactly as output by the script.
Do NOT re-walk the file tree, re-count lines, or re-derive categories — trust `scan-project.mjs` entirely. Do NOT re-implement import resolution — trust `extract-import-map.mjs` entirely.
Your only task in this phase is to produce the final `description` field:
**IMPORTANT:** The final output must NOT contain the `scriptCompleted` or `stats` fields from either bundled script, nor your transient `rawDescription` / `readmeHead` work-strings. Strip them when assembling the final JSON. The final `importMap` MUST equal the `importMap` field from `extract-import-map.mjs` verbatim (do not edit, re-sort, or filter it). The final `files` array MUST equal Step B's `files` array verbatim (do not re-order, drop, or augment it).
Your only synthesis task in this phase is the final `description` field:
1. If `rawDescription` is non-empty, use it as the basis. Clean it up if needed (remove marketing fluff, ensure it is 1-2 sentences).
2. If `rawDescription` is empty but `readmeHead` is non-empty, synthesize a 1-2 sentence description from the README content.
@@ -334,25 +203,25 @@ Then assemble the final output JSON:
```
**Field requirements:**
- `name` (string): directly from script output
- `name` (string): from your Step A narrative work
- `description` (string): your synthesized 1-2 sentence description
- `languages` (string[]): directly from script output
- `frameworks` (string[]): directly from script output
- `files` (object[]): directly from script output, including `fileCategory` per file
- `totalFiles` (integer): directly from script output
- `filteredByIgnore` (integer): directly from script output
- `estimatedComplexity` (string): directly from script output
- `importMap` (object): directly from script output
- `languages` (string[]): from your Step A narrative work (deduplicated, sorted alphabetically; cross-checked against Step B's `stats.byLanguage` keys)
- `frameworks` (string[]): from your Step A narrative work; only confirmed frameworks (empty array if none detected)
- `files` (object[]): directly from Step B's `files[]` (verbatim, including `fileCategory`)
- `totalFiles` (integer): directly from Step B
- `filteredByIgnore` (integer): directly from Step B
- `estimatedComplexity` (string): directly from Step B
- `importMap` (object): directly from Step C's `importMap` field
## Critical Constraints
- NEVER invent or guess file paths. Every `path` in the `files` array must come from the script's file discovery, which in turn comes from `git ls-files` or a real directory listing.
- NEVER invent or guess file paths. Every `path` in the `files` array must come from `scan-project.mjs`'s output (which itself comes from `git ls-files` or a real directory listing).
- NEVER include files that do not exist on disk.
- ALWAYS validate that `totalFiles` matches the actual length of the `files` array.
- ALWAYS sort `files` by `path` for deterministic output.
- Include ALL discovered project files in `files` -- code, configs, docs, infrastructure, and data files. Only exclude binaries, lock files, generated files, and dependency directories.
- Every file MUST have a `fileCategory` field with one of: `code`, `config`, `docs`, `infra`, `data`, `script`, `markup`.
- Trust the script's output for all structural data. Your only contribution is the `description` field.
- Trust Step B for file enumeration + language detection + category assignment + line counts + complexity. Trust Step C for `importMap`. Your only synthesis is the `description` field (plus the Step A narrative fields: `name`, `frameworks`, `languages`).
- Do NOT re-implement file enumeration, language detection, or category assignment in your discovery script. Use the bundled `scan-project.mjs`. If the table doesn't cover your project type, file an issue rather than ad-hoc handling.
- Do NOT attempt to re-implement import resolution. The bundled `extract-import-map.mjs` handles all 12 supported code languages (TS, JS, Python, Go, Rust, Java, Kotlin, C#, Ruby, PHP, C, C++) deterministically via tree-sitter + per-language resolvers.
- Every file MUST have a `fileCategory` field with one of: `code`, `config`, `docs`, `infra`, `data`, `script`, `markup``scan-project.mjs` guarantees this; just don't strip it.
## Writing Results
+5 -3
View File
@@ -1,15 +1,17 @@
{
"name": "@understand-anything/skill",
"version": "2.7.4",
"version": "2.7.5",
"type": "module",
"main": "dist/index.js",
"types": "dist/index.d.ts",
"scripts": {
"build": "tsc",
"test": "vitest run"
"test": "node -e \"console.log('skill tests live at <repo-root>/tests/skill — run via root \\`pnpm test\\`')\""
},
"dependencies": {
"@understand-anything/core": "workspace:*"
"@understand-anything/core": "workspace:*",
"graphology": "~0.26.0",
"graphology-communities-louvain": "^2.0.2"
},
"devDependencies": {
"@types/node": "^22.0.0",
@@ -0,0 +1,7 @@
import { defineConfig } from 'vitest/config';
export default defineConfig({
test: {
include: ['src/**/*.test.{ts,tsx,mjs}'],
},
});
@@ -275,26 +275,32 @@ If the scan result includes `filteredByIgnore > 0`, report:
---
## Phase 1.5 — BATCH
Report: `[Phase 1.5/7] Computing semantic batches...`
Run the bundled batching script:
```bash
node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT
```
Reads `.understand-anything/intermediate/scan-result.json`, writes `.understand-anything/intermediate/batches.json`.
Capture stderr. Append any line starting with `Warning:` to `$PHASE_WARNINGS` for the final report.
If the script exits non-zero, the failure is hard — relay the full stderr to the user as a Phase 1.5 failure. Do not attempt to recover; the script's internal fallback (count-based) already handles recoverable issues. A non-zero exit means a fundamental problem (missing input file, malformed JSON, etc.).
---
## Phase 2 — ANALYZE
### Full analysis path
Batch the file list from Phase 1 into groups of **20-30 files each** (aim for ~25 files per batch for balanced sizes).
Load `.understand-anything/intermediate/batches.json` (produced by Phase 1.5). Iterate the `batches[]` array.
**Batching strategy for non-code files:**
- Group related non-code files together in the same batch when possible:
- Dockerfile + docker-compose.yml + .dockerignore → same batch
- SQL migration files → same batch (ordered by filename)
- CI/CD config files (.github/workflows/*) → same batch
- Documentation files (docs/*.md) → same batch
- This allows the file-analyzer to create cross-file edges (e.g., docker-compose `depends_on` Dockerfile)
- Non-code files can be mixed with code files in the same batch if batch sizes are small
- Each file's `fileCategory` from Phase 1 must be included in the batch file list
Report: `[Phase 2/7] Analyzing files — <totalFiles> files in <totalBatches> batches (up to 5 concurrent)...`
After batching, report the plan to the user:
> `[Phase 2/7] Analyzing files — <totalFiles> files in <totalBatches> batches (up to 5 concurrent)...`
For each batch, dispatch a subagent using the `file-analyzer` agent definition (at `agents/file-analyzer.md`). Run up to **5 subagents concurrently** using parallel dispatch. Append the following additional context:
For each batch, dispatch a subagent using the `file-analyzer` agent definition (at `agents/file-analyzer.md`). Run up to **5 subagents concurrently**. Append the following additional context:
> **Additional context from main session:**
>
@@ -303,14 +309,7 @@ For each batch, dispatch a subagent using the `file-analyzer` agent definition (
>
> $LANGUAGE_DIRECTIVE
Before dispatching each batch, construct `batchImportData` from `$IMPORT_MAP`:
```json
batchImportData = {}
for each file in this batch:
batchImportData[file.path] = $IMPORT_MAP[file.path] ?? []
```
Fill in batch-specific parameters below and dispatch:
Dispatch prompt template (fill in batch-specific values from `batches.json[i]`):
> Analyze these files and produce GraphNode and GraphEdge objects.
> Project root: `$PROJECT_ROOT`
@@ -318,11 +317,16 @@ Fill in batch-specific parameters below and dispatch:
> Languages: `<languages>`
> Batch: `<batchIndex>/<totalBatches>`
> Skill directory (for bundled scripts): `<SKILL_DIR>`
> Write output to: `$PROJECT_ROOT/.understand-anything/intermediate/batch-<batchIndex>.json`
> Output: write to `$PROJECT_ROOT/.understand-anything/intermediate/batch-<batchIndex>.json` (single-file mode) OR `batch-<batchIndex>-part-<k>.json` (split mode, per Step B of your output protocol).
>
> Pre-resolved import data for this batch (use this for all import edge creation — do NOT re-resolve imports from source):
> Pre-resolved import data for this batch (use directly — do NOT re-resolve imports from source):
> ```json
> <batchImportData JSON>
> <batchImportData JSON from batches.json[i].batchImportData>
> ```
>
> Cross-batch neighbors with their exported symbols (confidence boost for cross-batch edges):
> ```json
> <neighborMap JSON from batches.json[i].neighborMap>
> ```
>
> Files to analyze in this batch (every entry MUST be passed through to `batchFiles` with all four fields — `path`, `language`, `sizeLines`, `fileCategory`):
@@ -330,6 +334,8 @@ Fill in batch-specific parameters below and dispatch:
> 2. `<path>` (<sizeLines> lines, language: `<language>`, fileCategory: `<fileCategory>`)
> ...
**Output naming is per-batchIndex — no fusion.** If you fuse multiple small batches into a single file-analyzer dispatch for token efficiency, the dispatched agent must STILL write one output file per original `batchIndex` using `batch-<batchIndex>.json` or `batch-<batchIndex>-part-<k>.json`. The merge script's regex (`batch-(\d+)(?:-part-(\d+))?\.json`) silently drops any other naming (e.g., `batch-fused-8-13.json`, `batch-8-13.json`), losing every node and edge in that file. After each dispatch returns, verify each `batchIndex` in the dispatched input has a corresponding `batch-<batchIndex>.json` (or `batch-<batchIndex>-part-*.json`) on disk before proceeding to the next dispatch.
After ALL batches complete, report to the user: `Phase 2 complete. All <totalBatches> batches analyzed.`
Run the merge-and-normalize script bundled with this skill (located next to this SKILL.md file — use the skill directory path, not the project root):
@@ -337,7 +343,7 @@ Run the merge-and-normalize script bundled with this skill (located next to this
python <SKILL_DIR>/merge-batch-graphs.py $PROJECT_ROOT
```
This script reads all `batch-*.json` files from `$PROJECT_ROOT/.understand-anything/intermediate/`, then in one pass:
This script reads all `batch-*.json` files (including `batch-<i>-part-<k>.json` produced by file-analyzers that split their output) from `$PROJECT_ROOT/.understand-anything/intermediate/`, then in one pass:
- Combines all nodes and edges across batches
- Normalizes node IDs (strips double prefixes, project-name prefixes, adds missing prefixes)
- Normalizes complexity values (`low`→`simple`, `medium`→`moderate`, `high`→`complex`, etc.)
@@ -346,7 +352,7 @@ This script reads all `batch-*.json` files from `$PROJECT_ROOT/.understand-anyth
- Drops dangling edges referencing missing nodes
- Logs all corrections and dropped items to stderr
The merge script also runs a `tested_by` linker that canonicalizes test-coverage edges in two passes. **Pass 1** walks LLM-emitted `tested_by` edges and flips inverted ones in place (the LLM systematically emits `test → production` because it sees the import only when analyzing the test file); semantically broken edges (test↔test, prod↔prod, orphan endpoints) are dropped. **Pass 2** supplements with path-convention pairings (`X.ts` ↔ `X.test.ts`, JS/TS `__tests__/` and `<dir>/test/` walk-out, Python in-package `tests/`, Go `_test.go` sibling, Maven/Gradle `src/test/...` ↔ `src/main/...`, .NET `<svc>/tests/` ↔ `<svc>/src/...` and `<App>.Tests/` ↔ `<App>/`). Production nodes that end up sourcing any `tested_by` edge get a `"tested"` tag. All resulting edges run `production → test`.
The merge script also runs a `tested_by` linker that canonicalizes test-coverage edges in two passes. **Pass 1** walks LLM-emitted `tested_by` edges and flips inverted ones in place; semantically broken edges (test↔test, prod↔prod, orphan endpoints) are dropped. **Pass 2** supplements with path-convention pairings. Production nodes that end up sourcing any `tested_by` edge get a `"tested"` tag. All resulting edges run `production → test`.
Output: `$PROJECT_ROOT/.understand-anything/intermediate/assembled-graph.json`
@@ -354,7 +360,20 @@ Include the script's warnings in `$PHASE_WARNINGS` for the reviewer.
### Incremental update path
Use the changed files list from Phase 0. Batch and dispatch file-analyzer subagents using the same process as above (20-30 files per batch, up to 5 concurrent, with batchImportData constructed from $IMPORT_MAP), but only for changed files.
Write the changed-files list (one path per line) to a temp file:
```bash
git diff <lastCommitHash>..HEAD --name-only > $PROJECT_ROOT/.understand-anything/tmp/changed-files.txt
```
Run compute-batches with `--changed-files`:
```bash
node <SKILL_DIR>/compute-batches.mjs $PROJECT_ROOT \
--changed-files=$PROJECT_ROOT/.understand-anything/tmp/changed-files.txt
```
This produces a `batches.json` that contains only batches with changed files, but neighborMap entries still reference unchanged files (with their full-graph batchIndex) so cross-batch edges remain emittable.
Then dispatch file-analyzer subagents per the same template as the full path.
After batches complete:
1. Remove old nodes whose `filePath` matches any changed file from the existing graph
@@ -0,0 +1,555 @@
#!/usr/bin/env node
/**
* compute-batches.mjs — Phase 1.5 of /understand
*
* Reads scan-result.json, runs Louvain community detection on the import
* graph, and writes batches.json containing batches + neighborMap.
*
* Usage:
* node compute-batches.mjs <project-root> [--changed-files=<path>]
*
* Input: <project-root>/.understand-anything/intermediate/scan-result.json
* Output: <project-root>/.understand-anything/intermediate/batches.json
*/
import { readFileSync, writeFileSync, existsSync, realpathSync } from 'node:fs';
import { dirname, join, resolve } from 'node:path';
import { fileURLToPath, pathToFileURL } from 'node:url';
import { createRequire } from 'node:module';
const __filename = fileURLToPath(import.meta.url);
const PLUGIN_ROOT = resolve(dirname(__filename), '../..');
const require = createRequire(resolve(PLUGIN_ROOT, 'package.json'));
let core;
try {
core = await import(pathToFileURL(require.resolve('@understand-anything/core')).href);
} catch {
core = await import(pathToFileURL(resolve(PLUGIN_ROOT, 'packages/core/dist/index.js')).href);
}
const { TreeSitterPlugin, PluginRegistry, builtinLanguageConfigs, registerAllParsers } = core;
import Graph from 'graphology';
import louvain from 'graphology-communities-louvain';
/**
* For each code file, returns its top-level exported symbol names (functions,
* classes, exported consts). Per-file errors are swallowed into [] with a
* visible warning so a single bad file does not abort batching.
*
* Returns Map<path, string[]>.
*/
async function extractExports(projectRoot, codeFiles) {
let registry;
try {
const tsConfigs = builtinLanguageConfigs.filter(c => c.treeSitter);
const tsPlugin = new TreeSitterPlugin(tsConfigs);
await tsPlugin.init();
registry = new PluginRegistry();
registry.register(tsPlugin);
registerAllParsers(registry);
} catch (err) {
process.stderr.write(
`Warning: compute-batches: tree-sitter init failed (${err.message}) ` +
`— all symbols=[] in neighborMap — cross-batch edges limited to file-level\n`,
);
return new Map(codeFiles.map(f => [f.path, []]));
}
const exportsByPath = new Map();
for (const file of codeFiles) {
const abs = join(projectRoot, file.path);
let content;
try {
content = readFileSync(abs, 'utf-8');
} catch (err) {
process.stderr.write(
`Warning: compute-batches: exports extraction failed for ${file.path} ` +
`(read error: ${err.message}) — symbols=[] in neighborMap — ` +
`cross-batch edges to this file limited to file-level\n`,
);
exportsByPath.set(file.path, []);
continue;
}
try {
const analysis = registry.analyzeFile(file.path, content);
const names = (analysis?.exports || []).map(e => e.name).filter(Boolean);
exportsByPath.set(file.path, names);
} catch (err) {
process.stderr.write(
`Warning: compute-batches: exports extraction failed for ${file.path} ` +
`(analyze error: ${err.message}) — symbols=[] in neighborMap — ` +
`cross-batch edges to this file limited to file-level\n`,
);
exportsByPath.set(file.path, []);
}
}
return exportsByPath;
}
/**
* Build batches for non-code files per Groups A-E in the design spec.
* Returns Array<{ files: FileMeta[], mergeable: boolean }> — caller assigns
* batchIndex. `mergeable=false` for semantic Groups A-D (Dockerfile clusters,
* .github/workflows, .gitlab-ci/.circleci, SQL migrations) preserves their
* boundary intent across the merge-small pass; Group E (catch-all parent-dir
* grouping) is `mergeable=true` so its tiny singletons can be pooled.
*/
function buildNonCodeBatches(nonCodeFiles) {
const byPath = new Map(nonCodeFiles.map(f => [f.path, f]));
const consumed = new Set();
const groups = [];
const dirOf = p => p.includes('/') ? p.slice(0, p.lastIndexOf('/')) : '';
const baseOf = p => p.includes('/') ? p.slice(p.lastIndexOf('/') + 1) : p;
// Group A: per-directory Dockerfile clusters.
const dirsWithDockerfile = new Set(
[...byPath.keys()]
.filter(p => baseOf(p) === 'Dockerfile')
.map(dirOf),
);
for (const dir of [...dirsWithDockerfile].sort()) {
const inDir = [...byPath.keys()].filter(p => dirOf(p) === dir);
const cluster = inDir.filter(p => {
const b = baseOf(p);
return b === 'Dockerfile'
|| b === '.dockerignore'
|| b.startsWith('docker-compose.');
});
if (cluster.length) {
groups.push({ files: cluster.map(p => byPath.get(p)), mergeable: false });
cluster.forEach(p => consumed.add(p));
}
}
// Group B: .github/workflows/*
const ghWorkflows = [...byPath.keys()].filter(
p => p.startsWith('.github/workflows/') && (p.endsWith('.yml') || p.endsWith('.yaml')),
).filter(p => !consumed.has(p));
if (ghWorkflows.length) {
groups.push({ files: ghWorkflows.map(p => byPath.get(p)), mergeable: false });
ghWorkflows.forEach(p => consumed.add(p));
}
// Group C: .gitlab-ci.yml + .circleci/*
const ciFiles = [...byPath.keys()].filter(
p => (p === '.gitlab-ci.yml' || p.startsWith('.circleci/'))
&& !consumed.has(p),
);
if (ciFiles.length) {
groups.push({ files: ciFiles.map(p => byPath.get(p)), mergeable: false });
ciFiles.forEach(p => consumed.add(p));
}
// Group D: SQL migrations per migrations/ or migration/ directory.
// Defensive consumed.has check: no upstream group consumes SQL today, but
// future Group additions could; keep the check for forward-compat.
const migrationDirs = new Set(
[...byPath.keys()]
.filter(p => p.endsWith('.sql'))
.map(dirOf)
.filter(d => /(^|\/)migrations?$/.test(d)),
);
for (const dir of migrationDirs) {
const sqls = [...byPath.keys()]
.filter(p => dirOf(p) === dir && p.endsWith('.sql') && !consumed.has(p))
.sort();
if (sqls.length) {
groups.push({ files: sqls.map(p => byPath.get(p)), mergeable: false });
sqls.forEach(p => consumed.add(p));
}
}
// Group E: all remaining grouped by immediate parent dir, max 20 per batch
const remainingByDir = new Map();
for (const p of [...byPath.keys()].sort()) {
if (consumed.has(p)) continue;
const dir = dirOf(p);
if (!remainingByDir.has(dir)) remainingByDir.set(dir, []);
remainingByDir.get(dir).push(p);
}
// Per design spec: max files per parent-dir batch for Group E.
const MAX_E = 20;
for (const [, paths] of remainingByDir) {
for (let i = 0; i < paths.length; i += MAX_E) {
const slice = paths.slice(i, i + MAX_E);
groups.push({ files: slice.map(p => byPath.get(p)), mergeable: true });
}
}
return groups;
}
/**
* Build a lookup map from file path → batchIndex across all batches (code +
* non-code). Used to resolve cross-batch neighbor references in neighborMap.
*/
function buildBatchOfMap(allBatches) {
const m = new Map();
for (const b of allBatches) {
for (const f of b.files) m.set(f.path, b.batchIndex);
}
return m;
}
/**
* Returns Map<path, communityId> via Louvain. May throw — caller must catch
* and fall back if it does. Honors UA_COMPUTE_BATCHES_FORCE_LOUVAIN_THROW=1
* to allow tests to exercise the fallback path.
*/
function runLouvain(codeFiles, importMap) {
if (process.env.UA_COMPUTE_BATCHES_FORCE_LOUVAIN_THROW === '1') {
throw new Error('forced throw via UA_COMPUTE_BATCHES_FORCE_LOUVAIN_THROW');
}
const g = new Graph({ type: 'undirected', allowSelfLoops: false });
for (const f of codeFiles) g.addNode(f.path);
for (const [src, targets] of Object.entries(importMap)) {
if (!g.hasNode(src)) continue;
for (const tgt of targets) {
if (!g.hasNode(tgt) || src === tgt || g.hasEdge(src, tgt)) continue;
g.addEdge(src, tgt);
}
}
const cs = louvain(g); // { nodeId: communityId }
return new Map(Object.entries(cs));
}
/**
* Returns Map<path, communityId> via alphabetical chunking of `batchSize`
* files per batch. Deterministic, used as fallback when Louvain fails.
*/
function countBasedAssignment(codeFiles, batchSize = 12) {
const out = new Map();
const sorted = [...codeFiles].map(f => f.path).sort();
for (let i = 0; i < sorted.length; i++) {
out.set(sorted[i], `count_${Math.floor(i / batchSize)}`);
}
return out;
}
/**
* Pool small mergeable batches into "misc" batches to reduce dispatch overhead.
* Preserves semantic groupings (non-code Groups A-D, marked `mergeable=false`)
* regardless of size; only merges code Louvain singletons / orphans and
* Group E parent-dir batches that fall below MIN_BATCH_SIZE.
*
* On a 314-file microservices-demo run, vanilla Louvain produced 87 singleton
* communities → 87 dispatch tasks of size 1. This pass collapses them into
* ceil(N / MAX_MERGE_TARGET) misc batches, drastically cutting orchestration
* overhead while leaving the high-modularity communities untouched.
*
* Returns the rewritten batch list with reassigned batchIndex (1-based,
* keepers first preserving their relative order, misc batches appended).
*/
function mergeSmallBatches(bareBatches) {
// MIN_BATCH_SIZE=3: below this, file-analyzer dispatch overhead (subagent
// spin-up, prompt setup) dwarfs the per-file analysis cost — not worth a
// standalone batch.
const MIN_BATCH_SIZE = 3;
// MAX_MERGE_TARGET=25: stays below MAX_COMMUNITY_SIZE=35 so the misc-batch
// agent retains headroom for neighborMap context without overflowing.
const MAX_MERGE_TARGET = 25;
const keepers = [];
const smallMergeable = [];
for (const b of bareBatches) {
if (b.mergeable && b.files.length < MIN_BATCH_SIZE) {
smallMergeable.push(b);
} else {
keepers.push(b);
}
}
if (smallMergeable.length === 0) {
// Nothing to merge — strip mergeable flag and renumber for cleanliness.
return keepers.map((b, i) => ({
batchIndex: i + 1,
files: b.files,
}));
}
// Pool and sort deterministically by path so repeated runs match byte-for-byte.
const pooledFiles = smallMergeable
.flatMap(b => b.files)
.sort((a, b) => a.path.localeCompare(b.path));
const miscBatches = [];
for (let i = 0; i < pooledFiles.length; i += MAX_MERGE_TARGET) {
miscBatches.push({ files: pooledFiles.slice(i, i + MAX_MERGE_TARGET) });
}
// Use `Info:` rather than `Warning:` — singleton consolidation is a
// routine optimization, not a fallback/degrade path. Per
// [[feedback_visible_warnings]] only fallbacks should bubble as Warning:
// to the Phase 7 final report. Real warnings would get drowned out if
// every normal Louvain run with singletons (i.e. almost every run) added
// a Warning: line.
process.stderr.write(
`Info: compute-batches: merged ${smallMergeable.length} small batches ` +
`(${pooledFiles.length} files) into ${miscBatches.length} misc batches ` +
`— singletons and orphans consolidated\n`,
);
const final = [...keepers, ...miscBatches];
return final.map((b, i) => ({
batchIndex: i + 1,
files: b.files,
}));
}
// ── Main: load → Louvain (or count-fallback) → enrich → write batches.json ─
async function main() {
const projectRoot = process.argv[2];
if (!projectRoot) {
process.stderr.write('Usage: node compute-batches.mjs <project-root> [--changed-files=<path>]\n');
process.exit(1);
}
let changedFiles = null;
for (const arg of process.argv.slice(3)) {
const m = arg.match(/^--changed-files=(.+)$/);
if (m) {
const p = m[1];
let content;
try {
content = readFileSync(p, 'utf-8');
} catch (err) {
process.stderr.write(
`Error: compute-batches: --changed-files path not readable: ${p} (${err.message})\n`,
);
process.exit(1);
}
const lines = content
.split('\n')
.map(s => s.trim())
.filter(Boolean);
changedFiles = new Set(lines);
}
}
const scanPath = join(projectRoot, '.understand-anything', 'intermediate', 'scan-result.json');
if (!existsSync(scanPath)) {
process.stderr.write(`Error: scan-result.json not found at ${scanPath}\n`);
process.exit(1);
}
const scan = JSON.parse(readFileSync(scanPath, 'utf-8'));
const files = scan.files || [];
const codeFiles = files.filter(f => f.fileCategory === 'code');
const nonCodeFiles = files.filter(f => f.fileCategory !== 'code');
const importMap = scan.importMap || {};
process.stderr.write(`Loaded ${files.length} files (${codeFiles.length} code).\n`);
const exportsByPath = await extractExports(projectRoot, codeFiles);
let algorithm = 'louvain';
let perFileCommunity;
try {
perFileCommunity = runLouvain(codeFiles, importMap);
} catch (err) {
process.stderr.write(
`Warning: compute-batches: Louvain failed (${err.message}) ` +
`— falling back to count-based grouping (12 files/batch) ` +
`— module semantic boundaries lost\n`,
);
perFileCommunity = countBasedAssignment(codeFiles, 12);
algorithm = 'count-fallback';
}
// Group files by community id
const filesByCommunity = new Map();
for (const [path, cid] of perFileCommunity) {
if (!filesByCommunity.has(cid)) filesByCommunity.set(cid, []);
filesByCommunity.get(cid).push(path);
}
// Size enforcement only on louvain output. count-fallback already chunked.
const MAX_COMMUNITY_SIZE = 35;
const splitCommunities = new Map();
let nextSyntheticId = 0;
if (algorithm === 'louvain') {
for (const [cid, paths] of filesByCommunity) {
if (paths.length <= MAX_COMMUNITY_SIZE) {
splitCommunities.set(cid, paths);
continue;
}
process.stderr.write(
`Warning: compute-batches: community size ${paths.length} > max ${MAX_COMMUNITY_SIZE} ` +
`— splitting via alphabetical chunking — modularity may decrease\n`,
);
const sorted = [...paths].sort();
const parts = Math.ceil(paths.length / MAX_COMMUNITY_SIZE);
const perPart = Math.ceil(paths.length / parts);
for (let i = 0; i < parts; i++) {
const slice = sorted.slice(i * perPart, (i + 1) * perPart);
const synthId = `__split_${cid}_${nextSyntheticId++}`;
splitCommunities.set(synthId, slice);
}
}
} else {
for (const [cid, paths] of filesByCommunity) splitCommunities.set(cid, paths);
}
// Sort communities by size desc, then by min-path asc for determinism
const sortedCommunities = [...splitCommunities.entries()]
.sort((a, b) => {
if (b[1].length !== a[1].length) return b[1].length - a[1].length;
const minA = [...a[1]].sort()[0];
const minB = [...b[1]].sort()[0];
return minA.localeCompare(minB);
});
// Build per-batch file list with full file metadata from scan
const fileMetaByPath = new Map(files.map(f => [f.path, f]));
// Safe: every path in a community is a graph node, and graph nodes are a
// subset of files (see addNode loop above). fileMetaByPath.get() can
// never return undefined here.
// First-pass: assemble bare batches (no batchImportData/neighborMap yet).
// All Louvain communities are mergeable=true so the merge-small pass can
// collapse singletons / 2-file orphans. Non-code groups carry per-group
// mergeable flags from buildNonCodeBatches (false for semantic Groups A-D,
// true for Group E catch-all).
const codeBatchObjsBare = sortedCommunities.map(([, paths], idx) => ({
batchIndex: idx + 1,
files: paths.sort().map(p => fileMetaByPath.get(p)),
mergeable: true,
}));
const nonCodeGroups = buildNonCodeBatches(nonCodeFiles);
const nonCodeBatchObjsBare = nonCodeGroups.map((g, i) => ({
batchIndex: codeBatchObjsBare.length + i + 1,
files: g.files,
mergeable: g.mergeable,
}));
const bareBatches = [...codeBatchObjsBare, ...nonCodeBatchObjsBare];
const mergedBareBatches = mergeSmallBatches(bareBatches);
const batchOf = buildBatchOfMap(mergedBareBatches);
// Build reverse import map: target → [sources that import target]
const reverseImportMap = new Map();
for (const [src, targets] of Object.entries(importMap)) {
for (const tgt of targets) {
if (!reverseImportMap.has(tgt)) reverseImportMap.set(tgt, []);
reverseImportMap.get(tgt).push(src);
}
}
// Compute neighbor degree (number of import relations) per path, used for
// truncation when neighborMap[file] has > MAX_NEIGHBORS entries.
const NEIGHBOR_DEGREE = new Map();
for (const f of codeFiles) {
const outDeg = (importMap[f.path] || []).length;
const inDeg = (reverseImportMap.get(f.path) || []).length;
NEIGHBOR_DEGREE.set(f.path, outDeg + inDeg);
}
const MAX_NEIGHBORS = 50;
// Second-pass: enrich each batch with batchImportData + neighborMap
const batches = mergedBareBatches.map(b => {
const batchPaths = new Set(b.files.map(f => f.path));
const batchImportData = {};
const neighborMap = {};
for (const f of b.files) {
batchImportData[f.path] = (importMap[f.path] || []).slice();
// 1-hop neighbors: imports out + imported-by in, excluding same batch.
// Note on truncation: we measure "popularity" by total raw 1-hop neighbor
// count (rawCount), not kept.length. A widely-imported hub like a logger
// module may have N>50 inbound imports but, after Louvain + size
// enforcement, only some land in other batches — kept.length can be < 50
// while the file is still a high-degree hub whose missing relationships
// matter for downstream cross-batch edge confidence. Warning on rawCount
// surfaces this; truncation on kept ensures the JSON stays bounded.
const outNeighbors = importMap[f.path] || [];
const inNeighbors = reverseImportMap.get(f.path) || [];
const all = new Set([...outNeighbors, ...inNeighbors]);
const rawCount = all.size;
const filtered = [...all].filter(p => batchOf.has(p) && !batchPaths.has(p));
let kept = filtered.map(p => ({
path: p,
batchIndex: batchOf.get(p),
symbols: exportsByPath.get(p) || [],
}));
if (rawCount > MAX_NEIGHBORS) {
kept.sort((a, b2) => (NEIGHBOR_DEGREE.get(b2.path) || 0)
- (NEIGHBOR_DEGREE.get(a.path) || 0)
|| a.path.localeCompare(b2.path)); // deterministic tiebreak
const beforeSlice = kept.length;
kept = kept.slice(0, MAX_NEIGHBORS);
process.stderr.write(
`Warning: compute-batches: neighborMap for ${f.path} has high 1-hop degree ${rawCount} ` +
`— exceeds soft cap of ${MAX_NEIGHBORS} — keeping top ${kept.length} cross-batch entries ` +
`(${beforeSlice - kept.length} dropped by degree sort)\n`,
);
}
if (kept.length) neighborMap[f.path] = kept;
}
return { batchIndex: b.batchIndex, files: b.files, batchImportData, neighborMap };
});
let finalBatches = batches;
if (changedFiles) {
finalBatches = batches.filter(b => b.files.some(f => changedFiles.has(f.path)));
// batchIndex on filtered batches retains the full-graph assignment
// (the design says neighborMap should still reference unchanged files'
// full-graph batchIndex). No renumbering.
}
// Note: under --changed-files mode, totalFiles is the FULL project file
// count (unchanged from the input scan) while totalBatches reflects only
// the filtered set written to disk. batchIndex values on the kept batches
// preserve the full-graph assignment so neighborMap references resolve.
const output = {
schemaVersion: 1,
algorithm,
totalFiles: scan.files.length,
totalBatches: finalBatches.length,
exportsByPath: Object.fromEntries(exportsByPath),
batches: finalBatches,
};
const outPath = join(projectRoot, '.understand-anything', 'intermediate', 'batches.json');
writeFileSync(outPath, JSON.stringify(output, null, 2), 'utf-8');
const batchSizes = finalBatches.map(b => b.files.length);
const maxSize = batchSizes.length ? Math.max(...batchSizes) : 0;
const minSize = batchSizes.length ? Math.min(...batchSizes) : 0;
process.stderr.write(
`Wrote ${finalBatches.length} batches (sizes: max=${maxSize}, min=${minSize}) to ${outPath}\n`,
);
}
// ---------------------------------------------------------------------------
// Run only when executed directly as a CLI; importing the module (e.g. from
// tests) must not trigger main().
//
// Canonicalize both sides through realpathSync. Node ESM resolves
// import.meta.url through symlinks but pathToFileURL(process.argv[1]) preserves
// them, so a raw equality check silently no-ops when the script is invoked via
// a symlinked plugin install path (the default in Claude Code / Copilot CLI
// caches). See GitHub issue #162.
// ---------------------------------------------------------------------------
function isCliEntry() {
if (!process.argv[1]) return false;
try {
const modulePath = realpathSync(fileURLToPath(import.meta.url));
const argvPath = realpathSync(process.argv[1]);
return modulePath === argvPath;
} catch {
return false;
}
}
if (isCliEntry()) {
try {
await main();
} catch (err) {
process.stderr.write(`compute-batches.mjs failed: ${err.message}\n${err.stack}\n`);
process.exit(1);
}
}
File diff suppressed because it is too large Load Diff
@@ -1023,11 +1023,74 @@ def main() -> None:
print("Error: no batch-*.json files found in intermediate/", file=sys.stderr)
sys.exit(1)
print(f"Found {len(batch_files)} batch files:", file=sys.stderr)
# Group by logical batch index so the report distinguishes single-batch
# files from multi-part file-analyzer outputs. Files that don't match the
# `batch-<N>.json` / `batch-<N>-part-<K>.json` pattern (e.g. fused
# `batch-fused-8-13.json`, range `batch-8-13.json`) would otherwise be
# silently dropped during load — flag them loudly instead so the user
# can fix the file-analyzer agent.
from collections import defaultdict as _dd
by_batch = _dd(list)
unrecognized_batch_files: list[str] = []
for f in batch_files:
m = re.match(r"batch-(\d+)(?:-part-(\d+))?\.json", f.name)
if m:
by_batch[int(m.group(1))].append((f.name, int(m.group(2)) if m.group(2) else None))
else:
unrecognized_batch_files.append(f.name)
# Load batches
if unrecognized_batch_files:
preview = ", ".join(unrecognized_batch_files[:5])
suffix = (
f" (+{len(unrecognized_batch_files) - 5} more)"
if len(unrecognized_batch_files) > 5
else ""
)
print(
f"Warning: merge-batch-graphs: {len(unrecognized_batch_files)} "
f"batch file(s) with unrecognized filenames will be DROPPED — "
f"files: {preview}{suffix} — fix the file-analyzer agent to use "
f"only batch-<N>.json or batch-<N>-part-<K>.json patterns",
file=sys.stderr,
)
logical_count = len(by_batch)
multi_part = sum(1 for entries in by_batch.values() if len(entries) > 1)
print(
f"Found {len(batch_files)} batch files "
f"({logical_count} logical batches, {multi_part} multi-part):",
file=sys.stderr,
)
# Missing-part detection: for any logical batch with parts (len > 1), the
# set of part numbers MUST be contiguous starting at 1. Gaps suggest a
# truncated write — emit a visible warning so the user can investigate.
# Collect into `missing_part_warnings` so they also surface in the final
# phase report; stderr alone gets buried under the per-batch load lines.
missing_part_warnings: list[str] = []
for idx, entries in by_batch.items():
part_nums = [p for (_n, p) in entries if p is not None]
if not part_nums:
continue
present = set(part_nums)
expected = set(range(1, max(part_nums) + 1))
missing = sorted(expected - present)
if missing:
msg = (
f"batch {idx} has parts {sorted(present)} but "
f"missing part {missing} — possible truncated write — "
f"affected nodes/edges may be lost"
)
print(f"Warning: merge: {msg}", file=sys.stderr)
missing_part_warnings.append(msg)
# Load batches — skip unrecognized filenames so they don't pollute the
# merged graph with content the agent labeled incorrectly.
unrecognized_set = set(unrecognized_batch_files)
batches: list[dict[str, Any]] = []
for f in batch_files:
if f.name in unrecognized_set:
continue
batch = load_batch(f)
if batch is not None:
batches.append(batch)
@@ -1042,6 +1105,38 @@ def main() -> None:
# Merge and normalize
assembled, report = merge_and_normalize(batches)
# Surface missing multi-part files to the phase report (parallel to
# unrecognized-filename handling below). Stderr lines emitted during
# batch discovery get buried under per-batch load output — re-emitting
# via the report list ensures the Phase 4 review and final summary see
# the data-loss signal.
if missing_part_warnings:
report.append("")
report.append(
f"Warning: {len(missing_part_warnings)} batch(es) with missing parts "
f"— some nodes/edges silently dropped:"
)
for w in missing_part_warnings:
report.append(f" - {w}")
# Surface unrecognized-filename drops to the phase report so the
# downstream review step sees them, not just stderr.
if unrecognized_batch_files:
preview = ", ".join(unrecognized_batch_files[:5])
suffix = (
f" (+{len(unrecognized_batch_files) - 5} more)"
if len(unrecognized_batch_files) > 5
else ""
)
report.append("")
report.append(
f"Warning: dropped {len(unrecognized_batch_files)} batch file(s) "
f"with unrecognized filenames — files: {preview}{suffix}"
f"fix the file-analyzer agent to use only batch-<N>.json or "
f"batch-<N>-part-<K>.json patterns (every node/edge in these "
f"files was excluded from the final graph)"
)
# Recover any imports edges file-analyzer batches dropped despite
# `batchImportData` containing them. The project-scanner's importMap
# is the deterministic source of truth.
@@ -0,0 +1,794 @@
#!/usr/bin/env node
/**
* scan-project.mjs
*
* Deterministic file enumeration + language/category detection for the
* project-scanner agent. Replaces the LLM-written prose scanner that used to
* (a) author a per-run Node.js script (`tmp/ua-project-scan.js`), (b) walk the
* file tree, and (c) classify each file via lookup tables in LLM context — a
* pure rule-lookup pass that was being billed at LLM rates and adding many
* minutes of per-run latency on mid-sized monorepos.
*
* What the LLM still owns (Step A of project-scanner.md Phase 1):
* - Reading README + top-level manifests to synthesize `name`,
* `rawDescription`, `readmeHead`, `frameworks`, and the high-level
* `languages` narrative.
*
* What this script owns:
* - File enumeration (git ls-files preferred, recursive walk fallback)
* - `.understandignore` filtering (delegated to core's createIgnoreFilter)
* - Per-file language detection (extension + filename table)
* - Per-file category assignment (priority-ordered rules from
* project-scanner.md Step 4)
* - Line counting
* - Complexity estimation (project-scanner.md Step 7 thresholds)
*
* Usage:
* node scan-project.mjs <projectRoot> <outputPath>
*
* Output JSON (subset of what project-scanner.md Phase 1 expects — the LLM
* agent merges this with Step A's narrative fields and Step C's importMap to
* produce the final scan-result.json):
* {
* "scriptCompleted": true,
* "files": [{ "path": "...", "language": "...", "sizeLines": N, "fileCategory": "..." }, ...],
* "totalFiles": N,
* "filteredByIgnore": M,
* "estimatedComplexity": "small" | "moderate" | "large" | "very-large",
* "stats": { "filesScanned": N, "byCategory": {...}, "byLanguage": {...} }
* }
*
* Logging: stderr only (stdout reserved for piped tooling).
* Per-file resilience: read/stat failures emit
* `Warning: scan-project: <path> — <reason> — file skipped from output`
* to stderr and the file is dropped; the rest of the scan completes.
*
* Determinism: files are sorted by `path.localeCompare` before emission, and
* the underlying enumeration is deterministic (git ls-files returns a stable
* order; the fallback walker sorts each directory's entries).
*/
import { createRequire } from 'node:module';
import { dirname, resolve, join, basename, extname, relative, sep } from 'node:path';
import { fileURLToPath, pathToFileURL } from 'node:url';
import {
existsSync,
readFileSync,
readdirSync,
realpathSync,
statSync,
writeFileSync,
} from 'node:fs';
import { spawnSync } from 'node:child_process';
const __dirname = dirname(fileURLToPath(import.meta.url));
// skills/understand/ -> plugin root is two dirs up
const pluginRoot = resolve(__dirname, '../..');
const require = createRequire(resolve(pluginRoot, 'package.json'));
// ---------------------------------------------------------------------------
// Resolve @understand-anything/core
//
// Two-step resolution: try the workspace-linked package first, fall back to
// the installed plugin cache layout. pathToFileURL() is required on Windows
// because dynamic import() of raw "C:\..." paths throws
// ERR_UNSUPPORTED_ESM_URL_SCHEME (Node parses "C:" as a URL scheme).
// ---------------------------------------------------------------------------
let core;
try {
core = await import(pathToFileURL(require.resolve('@understand-anything/core')).href);
} catch {
core = await import(pathToFileURL(resolve(pluginRoot, 'packages/core/dist/index.js')).href);
}
const { createIgnoreFilter } = core;
// ---------------------------------------------------------------------------
// Language detection
//
// Mirrors the canonical extension list from
// understand-anything-plugin/packages/core/src/languages/configs/* and the
// project-scanner.md Step 3 table. Extensions are matched lowercase;
// filenames (Dockerfile, Makefile, etc.) are matched case-sensitively because
// the projects-in-the-wild use canonical capitalizations.
//
// Where the core configs and project-scanner.md diverge (rare), project-
// scanner.md wins because it is the user-facing contract.
// ---------------------------------------------------------------------------
/**
* Extension -> language id. Lowercase keys; lookup is `.ext.toLowerCase()`.
* Includes the legacy Step-3 mapping (.cfg/.ini/.env -> `config`) — note
* that `config` is a language id here, not a category. Category routing
* for these extensions is handled separately in CATEGORY_BY_EXT.
*/
const LANGUAGE_BY_EXT = Object.freeze({
// TypeScript / JavaScript
'.ts': 'typescript',
'.tsx': 'typescript',
'.js': 'javascript',
'.jsx': 'javascript',
'.mjs': 'javascript',
'.cjs': 'javascript',
// Python
'.py': 'python',
'.pyi': 'python',
// Go / Rust / Java / Kotlin / C# / Swift / Lua
'.go': 'go',
'.rs': 'rust',
'.java': 'java',
'.kt': 'kotlin',
'.kts': 'kotlin',
'.cs': 'csharp',
'.swift': 'swift',
'.lua': 'lua',
// Ruby / PHP
'.rb': 'ruby',
'.rake': 'ruby',
'.php': 'php',
// C / C++
'.c': 'c',
'.h': 'c',
'.cpp': 'cpp',
'.cc': 'cpp',
'.cxx': 'cpp',
'.hpp': 'cpp',
'.hxx': 'cpp',
// Vue / Svelte (no tree-sitter extractor, but project-scanner contract
// lists them as code languages — downstream import map will return [])
'.vue': 'vue',
'.svelte': 'svelte',
// Shell / Batch / PowerShell
'.sh': 'shell',
'.bash': 'shell',
'.zsh': 'shell',
'.ps1': 'powershell',
'.psm1': 'powershell',
'.psd1': 'powershell',
'.bat': 'batch',
'.cmd': 'batch',
// Markup / docs
'.html': 'html',
'.htm': 'html',
'.css': 'css',
'.scss': 'css',
'.sass': 'css',
'.less': 'css',
'.md': 'markdown',
'.mdx': 'markdown',
'.rst': 'markdown',
// Config / data
'.yaml': 'yaml',
'.yml': 'yaml',
'.json': 'json',
'.jsonc': 'jsonc',
'.toml': 'toml',
'.xml': 'xml',
'.xsl': 'xml',
'.xsd': 'xml',
'.plist': 'xml',
'.cfg': 'config',
'.ini': 'config',
'.env': 'config',
// Data / schema
'.sql': 'sql',
'.graphql': 'graphql',
'.gql': 'graphql',
'.proto': 'protobuf',
'.prisma': 'prisma',
'.csv': 'csv',
'.tsv': 'csv',
// Infra
'.tf': 'terraform',
'.tfvars': 'terraform',
// JVM build files (categorized via filename-or-extension)
'.gradle': 'gradle',
// .NET project files (mapped to extension-derived ids; downstream
// treats them as config — see CATEGORY_BY_EXT)
'.csproj': 'csproj',
'.sln': 'sln',
'.properties': 'properties',
'.mod': 'mod',
'.sum': 'sum',
});
/**
* Filename (no extension) -> language id. Compared case-sensitively against
* basename(path). Includes the most common no-extension conventions; anything
* NOT in this table with no extension falls back to `unknown`.
*
* Dockerfile.* variants (Dockerfile.dev, Dockerfile.prod) are handled by a
* startsWith check in `detectLanguage()` so we don't have to enumerate every
* possible suffix.
*/
const LANGUAGE_BY_FILENAME = Object.freeze({
Dockerfile: 'dockerfile',
Makefile: 'makefile',
GNUmakefile: 'makefile',
makefile: 'makefile',
Jenkinsfile: 'jenkinsfile',
Procfile: 'procfile',
Vagrantfile: 'vagrantfile',
});
/**
* Detect the language of a file by its path. Lowercase extension lookup,
* then no-extension filename lookup. Never returns null — falls back to
* the lowercased extension (without dot) or 'unknown' if there is no
* extension. Downstream consumers rely on this field always being a string
* (see project-scanner.md Step 3 "Fallback" note).
*/
export function detectLanguage(filePath) {
const base = basename(filePath);
const ext = extname(filePath).toLowerCase();
// Dockerfile.dev, Dockerfile.prod, etc. — common variant form.
if (base === 'Dockerfile' || base.startsWith('Dockerfile.')) return 'dockerfile';
// Dotfile names like .env, .env.local — path.extname returns '' for
// single-segment dotfiles (e.g. '.env') and the SECOND segment for
// compound dotfiles (e.g. '.local' for '.env.local'). Neither hits the
// intended LANGUAGE_BY_EXT['.env'] mapping. Try the leading dotfile
// portion first so `.env`, `.env.local`, `.env.production` all map.
const dotKey = dotfileKey(base);
if (dotKey && LANGUAGE_BY_EXT[dotKey]) return LANGUAGE_BY_EXT[dotKey];
if (ext) {
const byExt = LANGUAGE_BY_EXT[ext];
if (byExt) return byExt;
// Unknown extension → drop the leading dot, lowercase. Never null.
return ext.slice(1);
}
// No-extension file — try filename table.
const byFilename = LANGUAGE_BY_FILENAME[base];
if (byFilename) return byFilename;
return 'unknown';
}
/**
* Extract the canonical dotfile "extension" from a basename, or null.
*
* `.env` -> `.env`
* `.env.local` -> `.env`
* `.bashrc` -> `.bashrc`
* `package.json` -> null (not a dotfile)
*
* Used by both detectLanguage and detectCategory so dotfile-style configs
* (e.g., `.env`, `.env.local`, `.env.production`) get their leading
* segment treated as the implicit extension instead of falling through
* to `unknown` / `code`.
*/
function dotfileKey(base) {
if (!base.startsWith('.')) return null;
const m = base.match(/^(\.[a-z0-9]+)/i);
return m ? m[1].toLowerCase() : null;
}
// ---------------------------------------------------------------------------
// Category detection
//
// Implements the priority-ordered rules from project-scanner.md Step 4.
// Order matters: more specific rules must run before more general ones
// (e.g. `docker-compose.yml` is infra, not config).
//
// Categories: code | config | docs | infra | data | script | markup
// ---------------------------------------------------------------------------
/**
* Extension -> category. Used only after the higher-priority path-based
* checks (infra/docs exclusions) in `detectCategory()`. Plain extension
* lookup is intentionally last-resort — many configs need their full path
* inspected first.
*/
const CATEGORY_BY_EXT = Object.freeze({
// docs
'.md': 'docs',
'.mdx': 'docs',
'.rst': 'docs',
'.txt': 'docs',
'.text': 'docs',
// config
'.yaml': 'config',
'.yml': 'config',
'.json': 'config',
'.jsonc': 'config',
'.toml': 'config',
'.xml': 'config',
'.xsl': 'config',
'.xsd': 'config',
'.plist': 'config',
'.cfg': 'config',
'.ini': 'config',
'.env': 'config',
'.properties': 'config',
'.csproj': 'config',
'.sln': 'config',
'.mod': 'config',
'.sum': 'config',
'.gradle': 'config',
// infra
'.tf': 'infra',
'.tfvars': 'infra',
// data
'.sql': 'data',
'.graphql': 'data',
'.gql': 'data',
'.proto': 'data',
'.prisma': 'data',
'.csv': 'data',
'.tsv': 'data',
// script
'.sh': 'script',
'.bash': 'script',
'.zsh': 'script',
'.ps1': 'script',
'.psm1': 'script',
'.psd1': 'script',
'.bat': 'script',
'.cmd': 'script',
// markup
'.html': 'markup',
'.htm': 'markup',
'.css': 'markup',
'.scss': 'markup',
'.sass': 'markup',
'.less': 'markup',
});
/**
* Filenames (no extension or full filename with extension) that always
* map to `infra` regardless of their extension. Compared case-sensitively
* against basename(path).
*/
const INFRA_FILENAMES = new Set([
'Dockerfile',
'.dockerignore',
'Makefile',
'GNUmakefile',
'makefile',
'Jenkinsfile',
'Procfile',
'Vagrantfile',
'.gitlab-ci.yml',
]);
/**
* Detect the project-scanner category for a file. Priority order matches
* project-scanner.md Step 4 "Priority rule" — most specific wins.
*
* 1. LICENSE -> code (per the spec note "except LICENSE"). The Step-2
* exclusion table normally removes LICENSE, but if a project chooses to
* re-include it via `.understandignore` negation, it should NOT land in
* docs. We classify as `code` rather than inventing a new bucket.
* 2. Filename-based infra (Dockerfile, Makefile, Jenkinsfile,
* docker-compose.*, Vagrantfile, Procfile, .gitlab-ci.yml,
* .dockerignore).
* 3. Path-based infra (.github/workflows/, .circleci/, k8s/, kubernetes/,
* *.k8s.yml, *.k8s.yaml).
* 4. Extension-based mapping (CATEGORY_BY_EXT).
* 5. Fallback: `code` (matches the spec — "All other extensions").
*/
export function detectCategory(filePath) {
const base = basename(filePath);
const ext = extname(filePath).toLowerCase();
const posix = filePath.split(sep).join('/');
// Rule 1: LICENSE exception (project-scanner.md Step 4 table comment).
if (base === 'LICENSE') return 'code';
// Rule 2: infra by filename — Dockerfile + variants, Makefile,
// Jenkinsfile, docker-compose.*, Procfile, Vagrantfile, .gitlab-ci.yml,
// .dockerignore.
if (INFRA_FILENAMES.has(base)) return 'infra';
if (base === 'Dockerfile' || base.startsWith('Dockerfile.')) return 'infra';
if (base.startsWith('docker-compose.')) return 'infra';
if (base === 'compose.yml' || base === 'compose.yaml') return 'infra';
// Rule 3: infra by path.
if (posix.startsWith('.github/workflows/')) return 'infra';
if (posix.startsWith('.circleci/')) return 'infra';
// Match a `k8s/` or `kubernetes/` segment anywhere in the path.
if (/(^|\/)(k8s|kubernetes)\//.test(posix)) return 'infra';
// `*.k8s.yml` and `*.k8s.yaml` — Kubernetes-flavored YAML.
if (/\.k8s\.(ya?ml)$/i.test(base)) return 'infra';
// Rule 4: extension-based lookup.
if (ext) {
const byExt = CATEGORY_BY_EXT[ext];
if (byExt) return byExt;
}
// Rule 4.5: dotfile-style configs (.env, .env.local, .env.production).
// path.extname misses these — see dotfileKey docstring.
const dotKey = dotfileKey(base);
if (dotKey) {
const byDot = CATEGORY_BY_EXT[dotKey];
if (byDot) return byDot;
}
// Rule 5: filename-based config catch-all for no-extension config files
// commonly seen in JVM/Go/.NET projects (covered above for infra but not
// config). We don't enumerate every possible config filename here — that
// gets handled by the language map's no-extension entries upstream.
// Anything not matched falls through to `code`.
return 'code';
}
// ---------------------------------------------------------------------------
// Complexity estimation (project-scanner.md Step 7)
// ---------------------------------------------------------------------------
/**
* Map a total file count to a complexity tier. Thresholds are inclusive on
* the lower bound:
* - small: 1-30
* - moderate: 31-150
* - large: 151-500
* - very-large: >500
*
* Edge case: 0 files maps to `small` (the lowest tier) so the field is
* always set even on empty repos. Downstream consumers treat 0 files as
* a sentinel for "nothing to analyze" via `totalFiles`, not complexity.
*/
export function estimateComplexity(totalFiles) {
if (totalFiles <= 30) return 'small';
if (totalFiles <= 150) return 'moderate';
if (totalFiles <= 500) return 'large';
return 'very-large';
}
// ---------------------------------------------------------------------------
// File enumeration
// ---------------------------------------------------------------------------
/**
* Normalize a path to forward-slash POSIX. The project-scanner contract
* emits POSIX paths; we re-normalize so the output is stable across
* Windows/macOS/Linux.
*/
function toPosix(p) {
return p.split(sep).join('/');
}
/**
* Enumerate all files in `projectRoot` via `git ls-files`. Returns an
* array of project-relative POSIX paths, or null if the directory is not
* a git repository (or git is not installed). Caller falls back to the
* recursive walker.
*
* Why git ls-files first: it respects the repo's `.gitignore`, handles
* submodules sensibly, and gives a fast, deterministic listing. The walker
* is a strict superset of what git would emit (no .gitignore awareness),
* so the ignore filter has to do more work in the fallback path.
*/
function enumerateViaGit(projectRoot) {
const result = spawnSync('git', ['ls-files', '-co', '--exclude-standard'], {
cwd: projectRoot,
encoding: 'utf-8',
maxBuffer: 256 * 1024 * 1024, // 256MB — huge monorepos can produce >10MB of paths
});
if (result.status !== 0 || !result.stdout) return null;
// Each line is one path, project-relative, already POSIX on all platforms
// because git emits forward slashes regardless of OS.
return result.stdout
.split('\n')
.map(s => s.trim())
.filter(Boolean)
.map(toPosix);
}
/**
* Recursive directory walker — fallback when `git ls-files` is unavailable
* (no git, not a repo, or git refused). Skips hard-coded "obviously bad"
* directory names BEFORE invoking the ignore filter so we don't waste cycles
* descending into `node_modules/` etc. on huge trees.
*
* Yields project-relative POSIX paths in directory-sorted order so the
* output is deterministic without an extra sort pass.
*/
function enumerateViaWalk(projectRoot) {
// Hard skip — these directories are universally non-source and skipping
// at the walker level avoids materializing thousands of node_modules
// paths before the ignore filter drops them. The ignore filter still
// runs on everything else.
const HARD_SKIP_DIRS = new Set([
'node_modules',
'.git',
'.svn',
'.hg',
'__pycache__',
]);
const out = [];
function walk(absDir) {
let entries;
try {
entries = readdirSync(absDir, { withFileTypes: true });
} catch (err) {
process.stderr.write(
`Warning: scan-project: ${toPosix(relative(projectRoot, absDir)) || '.'} ` +
`— directory read failed (${err.message}) — subtree skipped\n`,
);
return;
}
// Sort deterministically by name; mix files and dirs together so the
// final output (after the path sort) is identical regardless of
// OS-specific readdir order.
entries.sort((a, b) => a.name.localeCompare(b.name));
for (const ent of entries) {
if (ent.isDirectory()) {
if (HARD_SKIP_DIRS.has(ent.name)) continue;
walk(join(absDir, ent.name));
} else if (ent.isFile()) {
const rel = toPosix(relative(projectRoot, join(absDir, ent.name)));
if (rel) out.push(rel);
}
// Symlinks intentionally ignored — git ls-files doesn't follow them
// either, and following them is a classic recursion-bomb footgun.
}
}
walk(projectRoot);
return out;
}
/**
* Enumerate all candidate files in `projectRoot`. Tries git ls-files first;
* falls back to a recursive walk if git is unavailable or this is not a
* repo. Returns an array of project-relative POSIX paths in unspecified
* order — caller is responsible for sorting + filtering.
*/
function enumerateFiles(projectRoot) {
const fromGit = enumerateViaGit(projectRoot);
if (fromGit !== null) return fromGit;
process.stderr.write(
`scan-project: git ls-files unavailable — falling back to recursive walk\n`,
);
return enumerateViaWalk(projectRoot);
}
// ---------------------------------------------------------------------------
// Filter accounting
//
// The project-scanner.md contract requires `filteredByIgnore` to count files
// dropped *specifically* by user `.understandignore` patterns (the delta
// beyond what the hardcoded defaults would have removed). We accomplish this
// by building TWO filters:
// - `defaultOnly`: defaults only, no user patterns
// - `combined`: defaults + user patterns (createIgnoreFilter)
// and counting paths that the combined filter excludes but the defaults-only
// filter would have kept.
//
// Negation (`!pattern`) is correctly handled by the combined filter — a file
// re-included via `!` won't be in the combined-excluded set, so it WON'T be
// counted in filteredByIgnore (it's "kept", not "additionally filtered").
// ---------------------------------------------------------------------------
/**
* Build a defaults-only IgnoreFilter — same patterns as createIgnoreFilter
* would apply, minus any user .understandignore content. We synthesize this
* via a temp directory with no .understandignore files so the core function
* still drives the matcher. (Re-implementing the ignore-package wiring here
* would risk subtle behavior drift from core's matcher.)
*/
function buildDefaultsOnlyFilter() {
// Use the createIgnoreFilter with a path that we KNOW has no .understandignore.
// `os.tmpdir()`-based fresh dir guarantees no user patterns leak in.
// The directory doesn't need to exist on disk because createIgnoreFilter
// only checks existsSync() before reading.
const fakeProjectRoot = join(
require('node:os').tmpdir(),
`ua-scan-defaults-${process.pid}-${Date.now()}`,
);
return createIgnoreFilter(fakeProjectRoot);
}
/**
* Determine whether `projectRoot` has any user .understandignore files.
* When neither file exists, the combined and defaults-only filters are
* identical, so we can skip the dual-filter accounting entirely.
*/
function hasUserIgnoreFile(projectRoot) {
return (
existsSync(join(projectRoot, '.understandignore'))
|| existsSync(join(projectRoot, '.understand-anything', '.understandignore'))
);
}
// ---------------------------------------------------------------------------
// Line counting
// ---------------------------------------------------------------------------
/**
* Count newline-delimited lines in a file. Returns the number of `\n`
* characters; this matches `wc -l` semantics (which counts newlines, not
* "lines of content"). Files without a trailing newline therefore report
* one fewer than the visible line count — same behavior as wc.
*
* Per-file failure: emits a Warning: and returns null. Caller decides
* whether to drop the file or keep it with sizeLines=0.
*/
function countLines(absPath, posixPath) {
try {
const buf = readFileSync(absPath);
// Manual newline count beats split('\n').length on large files — no
// intermediate array allocation. We count the `\n` byte (0x0a) directly.
let count = 0;
for (let i = 0; i < buf.length; i++) {
if (buf[i] === 0x0a) count++;
}
return count;
} catch (err) {
process.stderr.write(
`Warning: scan-project: ${posixPath} — line count failed ` +
`(${err.message}) — file skipped from output\n`,
);
return null;
}
}
// ---------------------------------------------------------------------------
// Main
// ---------------------------------------------------------------------------
async function main() {
const [, , projectRoot, outputPath] = process.argv;
if (!projectRoot || !outputPath) {
process.stderr.write(
'Usage: node scan-project.mjs <projectRoot> <outputPath>\n',
);
process.exit(1);
}
if (!existsSync(projectRoot)) {
process.stderr.write(
`scan-project.mjs failed: projectRoot does not exist: ${projectRoot}\n`,
);
process.exit(1);
}
const projectRootStat = statSync(projectRoot);
if (!projectRootStat.isDirectory()) {
process.stderr.write(
`scan-project.mjs failed: projectRoot is not a directory: ${projectRoot}\n`,
);
process.exit(1);
}
// 1. Enumerate. Either git ls-files or recursive walk.
const candidates = enumerateFiles(projectRoot);
// 2. Filter via createIgnoreFilter (defaults + user .understandignore).
// Build a defaults-only filter in parallel to count user-driven drops.
const combined = createIgnoreFilter(projectRoot);
const userIgnoresPresent = hasUserIgnoreFile(projectRoot);
const defaultsOnly = userIgnoresPresent ? buildDefaultsOnlyFilter() : combined;
let filteredByIgnore = 0;
const kept = [];
for (const rel of candidates) {
const isIgnoredCombined = combined.isIgnored(rel);
if (!isIgnoredCombined) {
kept.push(rel);
continue;
}
// Dropped by combined filter. If defaults-only would have ALSO dropped
// it, this is a baseline default drop — not counted. If defaults-only
// would have KEPT it, this drop is attributable to the user's
// .understandignore content.
if (userIgnoresPresent && !defaultsOnly.isIgnored(rel)) {
filteredByIgnore++;
}
}
// 3. Per-file: language + category + line count.
// Drop files that fail line counting (per-file resilience).
const fileEntries = [];
for (const rel of kept) {
const absPath = join(projectRoot, rel);
// Stat first — git ls-files could include paths that vanished between
// listing and processing; the walker shouldn't but defensive anyway.
try {
const st = statSync(absPath);
if (!st.isFile()) {
// Symlinks-to-dir, special files, etc. — skip silently. Not a
// warning condition because git wouldn't have tracked it as a file.
continue;
}
} catch (err) {
process.stderr.write(
`Warning: scan-project: ${rel} — stat failed (${err.message}) ` +
`— file skipped from output\n`,
);
continue;
}
const sizeLines = countLines(absPath, rel);
if (sizeLines === null) {
// countLines already emitted the Warning: line.
continue;
}
fileEntries.push({
path: rel,
language: detectLanguage(rel),
sizeLines,
fileCategory: detectCategory(rel),
});
}
// 4. Determinism: sort by path.localeCompare.
fileEntries.sort((a, b) => a.path.localeCompare(b.path));
// 5. Stats.
const byCategory = {};
const byLanguage = {};
for (const f of fileEntries) {
byCategory[f.fileCategory] = (byCategory[f.fileCategory] || 0) + 1;
byLanguage[f.language] = (byLanguage[f.language] || 0) + 1;
}
const estimatedComplexity = estimateComplexity(fileEntries.length);
const output = {
scriptCompleted: true,
files: fileEntries,
totalFiles: fileEntries.length,
filteredByIgnore,
estimatedComplexity,
stats: {
filesScanned: fileEntries.length,
byCategory,
byLanguage,
},
};
writeFileSync(outputPath, JSON.stringify(output, null, 2), 'utf-8');
if (!existsSync(outputPath)) {
throw new Error(`output file missing after write: ${outputPath}`);
}
process.stderr.write(
`scan-project: filesScanned=${fileEntries.length} ` +
`filteredByIgnore=${filteredByIgnore} ` +
`complexity=${estimatedComplexity}\n`,
);
}
// ---------------------------------------------------------------------------
// Run only when executed directly as a CLI; importing the module (e.g. from
// tests) must not trigger main().
//
// Canonicalize both sides through realpathSync. Node ESM resolves
// import.meta.url through symlinks but pathToFileURL(process.argv[1]) preserves
// them, so a raw equality check silently no-ops when the script is invoked via
// a symlinked plugin install path (the default in Claude Code / Copilot CLI
// caches). See GitHub issue #162.
// ---------------------------------------------------------------------------
function isCliEntry() {
if (!process.argv[1]) return false;
try {
const modulePath = realpathSync(fileURLToPath(import.meta.url));
const argvPath = realpathSync(process.argv[1]);
return modulePath === argvPath;
} catch {
return false;
}
}
if (isCliEntry()) {
try {
await main();
} catch (err) {
process.stderr.write(`scan-project.mjs failed: ${err.message}\n${err.stack}\n`);
process.exit(1);
}
}
// Default export of helpers for testability.
export default {
detectLanguage,
detectCategory,
estimateComplexity,
};
@@ -0,0 +1,14 @@
import { defineConfig } from 'vitest/config';
// The plugin package no longer ships any test files — they were relocated
// to the repo-root `tests/` tree so they no longer ride along with the
// plugin marketplace bundle. This config exists solely to shadow the
// repo-root vitest.config.ts (which would otherwise be inherited via
// upward config discovery from this cwd) and explicitly resolve no tests.
//
// Run skill tests from the repo root with `pnpm test` instead.
export default defineConfig({
test: {
include: [],
},
});
+25
View File
@@ -0,0 +1,25 @@
import { defineConfig } from 'vitest/config';
// Single-config aggregation for the whole monorepo. Picks up:
// - tests/** — relocated skill tests (out-of-plugin so they
// do not ship via the marketplace bundle)
// - understand-anything-plugin/src/** — skill TS source tests
// - understand-anything-plugin/packages/dashboard/** — dashboard utils tests
//
// The `@understand-anything/core` package owns its own vitest.config.ts and is
// invoked separately via `pnpm --filter @understand-anything/core test`; its
// files are excluded here to avoid double-counting.
export default defineConfig({
test: {
include: [
'tests/**/*.test.{js,mjs,ts}',
'understand-anything-plugin/src/**/*.test.{js,mjs,ts}',
'understand-anything-plugin/packages/dashboard/**/*.test.{js,mjs,ts,tsx}',
],
exclude: [
'**/node_modules/**',
'**/dist/**',
'understand-anything-plugin/packages/core/**',
],
},
});