mirror of
https://github.com/Egonex-AI/Understand-Anything.git
synced 2026-06-22 10:58:03 +08:00
c49c46d974
A deep audit of the project-scanner โ file-analyzer โ merge pipeline
turned up a wide range of silent data-loss bugs. Each one alone is
small; together they were producing graphs with very few import edges,
missing sub-file nodes for non-code formats, and inconsistent metrics.
Root-cause fixes (high impact):
- project-scanner.md: extend import-pattern table to resolve absolute
imports for Python (`from a.b.c import x`), TS/JS (tsconfig.json
paths/baseUrl aliases), Java/Kotlin (`com.foo.Bar` โ file paths),
Ruby (`require 'foo/bar'` load-path), PHP (composer PSR-4 namespaces),
and C/C++ (`#include` headers). Was relative-only, which produced
empty importMap entries for the majority of real projects.
- project-scanner.md: add `.ps1`, `.bat`, `.cmd`, `.jsonc` to language
table; require non-null `language` field with an explicit fallback.
- file-analyzer.md: document `sections`, `definitions`, `services`,
`endpoints`, `steps`, `resources` in the extraction-output schema and
spell out the sub-file node-creation rules per category. Was missing,
so per-table / endpoint / resource nodes were never created from
SQL / OpenAPI / Terraform / K8s / Dockerfile parser output.
- file-analyzer.md: add explicit source-reading fallback rules for
PowerShell, Batch, Bash, Swift, Kotlin (no tree-sitter coverage).
- yaml-parser: declare `kubernetes`, `docker-compose`, `github-actions`,
`openapi` languages so files the language-registry tags with those
ids actually get section extraction. Recognize quoted top-level keys
(e.g. `"on":` in GitHub Actions). Emit one section per entry for
array-root YAML documents.
- json-parser: declare `json-schema`, `openapi`; add `stripJsoncSyntax`
helper that removes line / block comments and trailing commas before
parse so `.jsonc` files (wrangler, tsconfig with comments) parse cleanly.
- shell-parser: declare `jenkinsfile`. Tighten function-detection regex
to require a reachable `{` brace so `name() echo hi` and patterns
appearing inside heredocs are no longer false-positives.
- markdown-parser: track fenced-code-block state and skip headings
inside ``` / ~~~ blocks (`# install` shell comments were being
emitted as level-1 sections).
- merge-batch-graphs.py: add `article`, `entity`, `topic`, `claim`,
`source` to VALID_NODE_PREFIXES and TYPE_TO_PREFIX so knowledge-base
node types stop being flagged unknown / coerced to `file:`. Add
`direction` to the edge dedup key so `forward` and `bidirectional`
variants of the same (src, tgt, type) don't overwrite each other.
Use a placeholder in bare-id fallback when `filePath` is missing on
function/class nodes so unrelated `parse()` functions don't merge.
- typescript-extractor: actually compute `isDefault` for default
exports (was always emitted as `false` from buildResult).
- extract-structure.mjs: match `wc -l` semantics for `totalLines` so
the scanner's `sizeLines` and the extractor's `totalLines` agree on
POSIX text files. Filter the parser-imports fallback to relative-only
so `importCount` semantics stay *internal-import* whether the scanner
resolved them or not. Drop unused `isCode` local.
Tests: +19 cases covering JSONC parsing, markdown fenced-code skip,
YAML quoted-keys / array-root, shell function false-positives,
extract-structure import fallback semantics + totalLines off-by-one.
764 passing (was 745).
Bumps version to 2.6.2 across the five tracked manifests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c49c46d974
ยท
2026-05-07 21:44:24 +08:00
History