3 Commits

  • fix(lib): retry rename on Windows EPERM/EACCES/EBUSY in writeBridgeAtomic
    PR #1983 round 1 introduced unique-suffix tmp paths so two concurrent
    writers no longer share a single `.tmp` file. That fix is correct
    under POSIX semantics — `rename(2)` is atomic between source and
    destination, so each writer renames onto the same target without
    conflict.
    
    Windows `MoveFileExW` is not the same. It fails with
    EPERM / EACCES / EBUSY when the target is currently being renamed
    by *another* process — a short race window that fires reliably under
    this hook's PostToolUse + statusline concurrency. Round 1's CI run
    made this visible:
    
      Test (windows-latest, Node 18.x, npm) — FAILURE
      Error: EPERM: operation not permitted, rename
        'C:\…\ecc-metrics-test-bridge-race-….json.9504.4aef575a.tmp' ->
        'C:\…\ecc-metrics-test-bridge-race-….json'
          at writeBridgeAtomic (scripts/lib/session-bridge.js:79:8)
    
    All nine Windows matrix cells (Node 18 / 20 / 22 × npm / pnpm / yarn)
    hit the same path. POSIX matrices (Linux + macOS) passed unchanged.
    
    Fix: extract a `renameWithRetry(tmp, target)` helper that retries
    `fs.renameSync` up to 5 times on EPERM / EACCES / EBUSY with
    exponential backoff (20 ms → 320 ms total). Other error codes
    (ENOENT, ENOSPC, EROFS, …) re-throw on the first attempt — they are
    not transient. POSIX runs hit the first try and exit immediately.
    
    The backoff uses `Atomics.wait` on a throwaway `SharedArrayBuffer`
    so the retry path does not busy-spin the CPU; verified on Node ≥ 17
    that this works on the main thread. There is a `try/catch` fallback
    to a brief busy-wait for older runtimes where `Atomics.wait` is
    restricted to workers.
    
    `writeBridgeAtomic` calls the helper instead of `fs.renameSync` and
    keeps its existing best-effort tmp cleanup on terminal failure.
    
    `renameWithRetry` is added to `module.exports` so the companion
    `writeWarnState` in `scripts/hooks/ecc-context-monitor.js` can
    adopt the same retry policy without duplicating the helper. That
    adoption lands in the next commit.
    
    Local: `node tests/lib/session-bridge.test.js` 14/14, `yarn test`
    green, `yarn lint` clean. The round-1 test (two concurrent child
    writers, 200 iterations each) now passes on macOS without retrying
    at all (POSIX path) and is expected to pass on Windows via the new
    retry loop.
  • fix(lib): use unique tmp suffix in writeBridgeAtomic to eliminate ENOENT race
    `writeBridgeAtomic` wrote to a fixed `${target}.tmp` path before
    calling `renameSync`. When two processes write to the same session
    bridge concurrently (e.g. PostToolUse `ecc-metrics-bridge` + the
    background `ecc-statusline`, both calling `writeBridgeAtomic(sessionId, ...)`),
    the canonical atomic-rename race fires:
    
      1. Process A: writeFileSync(target.tmp, JSON_A) — tmp file exists.
      2. Process B: writeFileSync(target.tmp, JSON_B) — tmp file overwritten.
      3. Process A: renameSync(target.tmp, target) — succeeds; target = JSON_B
         (A's payload silently corrupted en-route).
      4. Process B: renameSync(target.tmp, target) — throws ENOENT (the
         rename consumed the file).
    
    Every caller in the repo wraps `writeBridgeAtomic` in `try {} catch {}`,
    so the ENOENT exception is swallowed and the user-visible symptom is
    just "the bridge file occasionally contains the wrong process's
    payload" with no diagnostic.
    
    Reproduced before this commit:
    
      $ # two concurrent writers, each calling writeBridgeAtomic 500 times
      $ # against the same session ID
      [A] errors=244   # 244 ENOENT exceptions swallowed
      [B] errors=248   # ditto
    
    After this commit the same workload reports 0 errors in both
    subprocesses: tmp paths no longer collide.
    
    Fix: change `${target}.tmp` to
    `${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`,
    matching the pattern already used by `writeCostWarningIfChanged` in
    `scripts/hooks/ecc-metrics-bridge.js` (commit 9b1d8918). The pid +
    4-byte nonce gives each writer process a distinct tmp path, so step 2
    above no longer overwrites step 1's payload and step 4 no longer
    races step 3.
    
    Also added: on `renameSync` failure, attempt `fs.unlinkSync(tmp)` so
    a writer that fails (disk full, permission, parent dir gone) does
    not leak its tmp file. The cleanup is best-effort and the original
    error is still re-thrown.
    
    **Scope clarification.** This commit closes the atomic-rename
    primitive's race only. The *read-modify-write* race in callers —
    two writers each read the same bridge state, increment, and write
    back, the second clobbering the first — is a separate concern that
    needs locking or per-writer logs, and is intentionally out of scope
    for this PR. The cost-tracker / metrics-bridge callers tolerate
    last-writer-wins on their cumulative aggregates today and this
    commit does not change that contract.
    
    The companion `writeWarnState` in `ecc-context-monitor.js` has the
    same fixed-suffix pattern and the same race; that fix lands in the
    next commit so each can be reviewed against its own diff.
  • feat: add ECC statusline observability hooks
    Salvages the useful statusline/context monitor work from stale PR #1504 while preserving the current continuous-learning hook runner wiring.
    
    Adds the metrics bridge, context monitor, statusline script, shared cost/session bridge utilities, and tests. Fixes the reviewed false loop-detection hash collision for non-file tools, avoids default-session cost inflation, sanitizes statusline task lookup, and records hook payload session IDs in cost-tracker.