# BLADE-SWARM Simulation Audit
## Defense-Grade and Aerospace-Grade Verification and Validation Review

**Artifact under review**: `blade-swarm-sim.html` (browser-based AUTHREX-SWARM simulator, v1.0, 1,250 lines)
**Companion artifacts referenced**: WP-2026-08, ICD-SWARM-001, AUTHREX_SWARM.tla
**Reviewer role**: Senior simulation engineer / DoD systems reviewer / DARPA technical evaluator / mission assurance / V&V
**Review date**: May 2026
**Review method**: Direct source inspection of the decision engine, tick loop, scenario model, ledger, and rendering path. All findings below are tied to specific line numbers and were verified in code, not inferred from documentation.

> **Evidence classification used throughout:**
> **[VERIFIED]** confirmed by reading the source.
> **[ASSUMPTION]** my working assumption about intent, stated as such.
> **[SPECULATION]** plausible but unconfirmed.
> Nothing in this audit is asserted as correct merely because the simulation runs.

---

## Executive Technical Assessment

The artifact is a **high-quality illustrative demonstrator** and a **low-rigor simulation**. Those are two different things, and the gap between them is the central finding of this audit.

The single most important finding, stated plainly:

> **[VERIFIED] The headline metric — the per-tier decision distribution that every scenario reports — is not emergent. It is drawn directly from a hardcoded probability table (`SCENARIOS[x].expected`) by a single `Math.random()` call (lines 914–919). The simulated node states (healthy / compromised / lost) and the sub-quorum structure are computed and animated, but they are never read by the decision function. `runDecision()` consults no node state, no vote, and no quorum threshold (verified: lines 904–961 reference none of `status`, `healthy`, `compromised`, `lost`, `members`, or any 2f+1 tally).**

In other words, the simulator *replays* the results the paper claims rather than *producing* them. You can lose every node in S2 or compromise every node in S4 and the reported tier distribution will be statistically unchanged, because the outcome is sampled from `expected`, not derived from the swarm. The attrition and Byzantine logic change what is drawn on the canvas; they do not change what the simulation concludes.

For a website figure, a conference booth, or an explainer embedded in the paper, this is acceptable and even good. For a defense-grade or aerospace-grade simulation that a DARPA or NASA reviewer would open and inspect, it is not, because the quantitative claims it appears to substantiate are in fact pre-loaded constants. A reviewer who reads the source will discount every number the tool displays.

This does not mean the project is weak. It means the *browser simulator* is currently a visualization, and the rigorous simulation described in WP-2026-08 (agent-based, model-checked) must either (a) be the actual evidence base, with the browser tool clearly labeled as an illustration, or (b) be ported into the browser tool so that outcomes emerge from the protocol. Option (b) is the higher-value path and is the spine of the roadmap below.

**Bottom line:** strong communication artifact, not yet a simulation. The fixes are well-defined and achievable.

---

## Architecture Review

**[VERIFIED] Structure.** Single-file HTML/CSS/JS, no external dependencies. Clear separation: scenario model (`SCENARIOS`), global `state`, node initializer (`initNodes`), decision engine (`runDecision`), physics/animation (`tick`, `draw`, `animatePipeline`), and DOM sync (`updateMetrics`, `renderSubQuorums`, `addLedgerEntry`). The code is readable and the module boundaries are sensible.

**[VERIFIED] Control flow.** A single `requestAnimationFrame` loop (`tick`, line 1099) advances sim time by a fixed 0.05 s per frame and fires a decision every 6 ticks (~0.3 s). This is a fixed-step loop, which is appropriate for a demonstrator.

**Weaknesses:**

1. **[VERIFIED] No coupling between state and decisions.** This is the architectural defect. The node and sub-quorum data structures exist (lines 740–788) and are mutated by attrition (1106–1117) and compromise (1119–1127), but the decision path ignores them entirely. The architecture has a "model" and a "view" but no data flow from model to outcome. A coherent simulation architecture would route node health into the consensus tally and let the tally drive the tier.

2. **[VERIFIED] Frame-rate-coupled time.** Sim time advances per animation frame (line 1102). On a throttled or backgrounded tab, `requestAnimationFrame` slows, so wall-clock and sim-clock diverge and the "60 s mission" becomes variable. A simulation clock should be decoupled from the render clock.

3. **[VERIFIED] Sub-quorum sizing contradicts the formal artifacts.** `initNodes` sets sub-quorum size to `round(sqrt(N))` (line 772), giving 3 for N=10. The paper, ICD, and TLA+ specify the Byzantine sub-quorum as 2f+1, i.e. 7 for N=10. The simulator and the formal specification disagree on the core parameter of the protocol. A reviewer cross-checking the artifacts will catch this.

4. **[VERIFIED] Scale has no functional effect.** Because outcomes are sampled from `expected`, N=10 and N=500 produce the same statistical distribution. The headline capability claim ("scales from N=10 to N=500") is demonstrated visually but not functionally. Scaling N changes the picture, not the result.

**Verdict:** internally tidy, but not coherent as a simulation because the model never informs the outcome.

---

## Functional Verification Findings

| # | Issue | Impact | Root cause | Exact correction |
|---|-------|--------|------------|------------------|
| F1 | **[VERIFIED]** Tier outcome sampled from `expected` table, not computed | All quantitative claims are predetermined | `runDecision` lines 914–919 use only `rnd()` and `sc.expected` | Replace the sampling block with a real tally: collect votes from healthy in-quorum nodes, reject compromised votes, require 2f+1 to commit at T3, downgrade tier when the healthy quorum margin shrinks |
| F2 | **[VERIFIED]** `packetLoss` defined in all 5 scenarios but never read in any logic (only in descriptions) | S5's defining condition ("40% packet loss") changes nothing measurable | No code path consumes `sc.packetLoss` | Model message delivery per link: drop each vote with probability `packetLoss`; recompute whether quorum is still reachable |
| F3 | **[VERIFIED]** Ledger hash is `hashish(decisions*0x9e3779b9 + tick)` (line 952), not chained and not SHA-256 | The audit ledger's headline property (tamper-evident chain) is absent | `hashish` (line 734) is a truncated integer scramble; no entry incorporates the prior entry's hash | Compute `hash_k = SHA256(hash_{k-1} || payload_k)` via Web Crypto `crypto.subtle.digest`; store and display the real chain |
| F4 | **[VERIFIED]** No ECDSA signing anywhere | The "ECDSA-signed audit ledger" claim is unbacked in the tool | Not implemented | Generate a P-256 keypair per node with Web Crypto `crypto.subtle.generateKey`, sign each commit, verify on aggregation |
| F5 | **[VERIFIED]** Byzantine compromise has no consensus consequence | S4's "MAIVA rejects compromised votes / CARA isolates" is asserted, not simulated | Compromise sets `.status` (1124) but no vote rejection or isolation logic exists in the decision path | Implement vote rejection and CARA isolation that actually removes compromised nodes from the tally and logs an isolation event |
| F6 | **[VERIFIED]** FLAME latency derives from already-chosen tier (line 940, `A` keyed on `tier`) | Causality is backwards: deliberation time should influence the decision, not be a cosmetic function of it | `A = {T3:1.0,...}[tier]` after tier is already fixed | Compute the FLAME window first from threat density, use it as the deliberation budget that bounds how many votes can be gathered, then let the achieved quorum drive the tier |
| F7 | **[VERIFIED]** Attrition can drop the swarm below quorum with no effect | Quorum infeasibility is never detected; T3 commits continue on a dead swarm | No 2f+1 feasibility check (verified: none present) | After each loss, check `healthy_in_quorum >= 2f+1`; if not, force tier downgrade or abort |

**Internal consistency that does hold [VERIFIED]:** the tier-ceiling clamp (lines 922–924) correctly prevents a decision from exceeding the commander's ceiling, the metrics counters increment consistently, and the ledger sequence numbers are monotonic. These are correct as written.

---

## Simulation Realism Analysis

**[VERIFIED] What is realistic:**
- The FLAME contraction *formula* itself (line 941, `W = max(80, 800·A·exp(-1.5·D))`) is a defensible functional form for a deliberation window that shrinks under threat density, and the 80 ms floor matches the ICD. This is the one genuine model equation in the tool.
- Attrition is gated to begin after 6 s (line 1108), which loosely mimics a mission ramp.

**[VERIFIED] What is not defensible:**
1. **No network model.** Real swarm consensus is dominated by message delivery, latency tails, and partition behavior. The tool has none. `packetLoss` is inert (F2). There is no propagation delay, no jitter, no partition, no reconvergence.
2. **No physical model.** Node positions drift by a tiny random velocity (lines 761) but there is no mobility model, no range-dependent link budget, no LoRa-vs-WiFi plane distinction in logic (only in the description). The two communication planes from the ICD are not modeled.
3. **Compromise is instantaneous and upfront** (line 1120, fires only at `tick === 1`). Real adversarial compromise is progressive and detection has a delay distribution. ADARA detection latency is not modeled at all.
4. **Attrition is memoryless and uniform** (uniform `pick` of a victim, line 1112). Real attrition is spatially and temporally correlated (a jammer affects a region; a salvo removes several nodes at once). No correlation is modeled.
5. **Latency is a flat scaled uniform** (line 942, `W·(0.4 + rnd()·0.4)`). Real consensus latency is heavy-tailed (p99 >> p50). A uniform distribution understates tail risk, which is exactly the risk a mission-assurance reviewer cares about.

**[ASSUMPTION]** The `expected` distributions were chosen to match WP-2026-08's reported aggregate results. If so, the realism concern is compounded: the tool cannot serve as independent confirmation of the paper because it was tuned to the paper's answers.

---

## Verification and Validation Review

This is the weakest area and the one most likely to fail a formal review.

| V&V requirement | Status | Finding |
|---|---|---|
| Reproducibility | **[VERIFIED] FAIL** | `Math.random()` is unseeded (no seed, PRNG, or run ID found anywhere). No run can be reproduced or replayed. |
| Traceability | **[VERIFIED] PARTIAL** | The ledger gives a per-decision trace within a run, but with no seed the trace cannot be regenerated, and the hash is not a real chain (F3). |
| Meaningful metrics | **[VERIFIED] FAIL as evidence** | The tier distribution is an input constant, not a measurement, so it cannot validate anything. |
| Test objectives / pass-fail criteria | **[VERIFIED] ABSENT** | No declared objective, no expected output independent of `expected`, no pass/fail gate. |
| Benchmark comparison | **[VERIFIED] ABSENT** | No comparison against a reference consensus implementation, a known BFT result, or the TLA+ model's reachable-state predictions. |
| Cross-artifact agreement | **[VERIFIED] FAIL** | Sub-quorum size disagrees with the paper/ICD/TLA+ (sqrt(N) vs 2f+1). |

**Required to reach V&V credibility:**
1. A seeded PRNG (e.g. mulberry32 / xorshift128+) with the seed shown in the UI and embedded in every exported run.
2. A declared test matrix: for each scenario, the objective, the independent expected behavior (e.g. "S4: zero compromised votes appear in any commit"), and an automatic pass/fail check.
3. A reference oracle: a minimal, separately-written PBFT or Raft-style tally to compare against, so the simulator's emergent distribution can be checked rather than asserted.
4. A replay file format (seed + scenario + N + ceiling + event log) that regenerates a run bit-for-bit.

---

## Failure Mode Review (FMEA)

Mission-style FMEA of the **modeled system** (the swarm) and, where noted, of the **simulator itself**. Severity 1 = catastrophic to mission, 5 = negligible.

| ID | Failure mode | Cause | Effect | Sev | Detection (current) | Mitigation (recommended) |
|----|--------------|-------|--------|-----|---------------------|--------------------------|
| FM1 | Quorum loss undetected | Attrition below 2f+1 | T3 commits on a swarm that cannot form a valid quorum | 1 | **[VERIFIED] None** | Add 2f+1 feasibility gate (F7); auto-downgrade to T1/T0 |
| FM2 | Byzantine vote accepted | Compromised node votes counted | Adversary steers a commit | 1 | **[VERIFIED] None in logic** | Implement vote rejection + signature verification (F4, F5) |
| FM3 | Ledger tamper | Entry altered post-hoc | Audit trail no longer trustworthy | 1 | **[VERIFIED] None (hash not chained)** | Real SHA-256 chain + per-entry ECDSA signature (F3, F4) |
| FM4 | Communication partition | Network split | Sub-swarms diverge, double-commit | 2 | **[VERIFIED] Not modeled** | Add partition model + reconvergence + conflict detection |
| FM5 | Latency tail breach | Heavy-tailed consensus delay | Decision misses its FLAME window | 2 | Partial (latency shown, but uniform) | Heavy-tailed latency model + window-breach counter |
| FM6 | Sensor corruption | Spoofed observation | Bad input enters SATA/IFF | 2 | **[VERIFIED] Not modeled** | Add input-validation stage + sensor confidence scoring |
| FM7 | Operator error | Wrong tier ceiling set | Over- or under-authorization | 3 | Ceiling clamp works | Add confirm step for T3 ceiling at high threat density |
| FM8 | Detection delay | ADARA lag | Compromise acts before isolation | 2 | **[VERIFIED] Not modeled (compromise instant)** | Model ADARA detection-latency distribution |
| FM9 | Cascading loss | Correlated attrition (jammer/salvo) | Rapid multi-node loss | 1 | **[VERIFIED] Not modeled (uniform)** | Spatially/temporally correlated attrition model |
| FM10 | Simulator non-reproducibility | Unseeded RNG | Cannot reproduce a finding | 1 (to V&V) | **[VERIFIED] None** | Seeded PRNG + replay file |

FM1, FM2, FM3, FM9, and FM10 are severity-1 and currently undetected. Those five are the priority set.

---

## Stress Test Findings

Based on source analysis (these are predicted from the code paths; **[VERIFIED]** where the path is unambiguous):

1. **Total attrition [VERIFIED]:** lose all nodes in S2 and the tool keeps issuing decisions sampled from `expected`; it does not detect that the swarm is gone (FM1).
2. **Full compromise [VERIFIED]:** raise `byzantineRate` to 1.0 and every node turns red, yet the tier distribution is unchanged because compromise never reaches the tally (F5).
3. **Maximum packet loss [VERIFIED]:** S5 at 40% loss is indistinguishable in metrics from S1 at 0% loss, because `packetLoss` is inert (F2).
4. **N=500 overload [SPECULATION]:** the canvas draws all nodes plus pairwise sub-quorum lines each frame; at N=500 this is likely to drop frame rate, which (because sim time is frame-coupled) silently slows the mission clock (architecture weakness 2). Worth profiling.
5. **Backgrounded tab [VERIFIED by mechanism]:** `requestAnimationFrame` throttles, so the "60 s mission" stretches in wall-clock time; results are unaffected but timing claims become unstable.
6. **Simultaneous attrition + compromise [VERIFIED]:** S4/S5 run both, but since neither feeds the outcome, the combination has no compounding effect, which is itself unrealistic (real systems degrade non-linearly under simultaneous stress).

**Where false confidence lives:** the tool looks most convincing exactly where it is least real. The dramatic red-node Byzantine scene (S4) and the lossy denied-environment scene (S5) are the two most visually persuasive scenarios and the two whose defining stressors have zero effect on the reported numbers.

---

## Telemetry and Observability Review

| Capability | Present? | Note |
|---|---|---|
| Live metrics display | **[VERIFIED] Yes** | Tier counts, decisions, CARA events, latency, ledger rate |
| Per-decision ledger view | **[VERIFIED] Yes** | Seq, tier, outcome, hash, timestamp (but hash not chained) |
| Event tracing | Partial | Decisions are logged; node-state transitions (loss, compromise, isolation) are not logged as events |
| Decision tracing | **[VERIFIED] No** | No record of *why* a tier was chosen (because nothing computes it) |
| Replay capability | **[VERIFIED] No** | No seed, no run file |
| Log export | **[VERIFIED] No** | Metrics and ledger are display-only; no download |
| Anomaly detection | **[VERIFIED] No** | ADARA is named in the pipeline animation but does no detection |
| Diagnostics / health panel | Partial | Compromised/lost counts shown; no quorum-health or window-breach indicators |

**Missing observability that a reviewer expects:** a structured event log (JSON lines) covering every state transition and decision with its causal inputs; a downloadable run record; a quorum-health time series; a latency histogram with p50/p95/p99; and a window-breach counter. None of these are present.

---

## Technical Credibility Findings

Flagged claims, with the gap between claim and implementation:

1. **[VERIFIED] "MAIVA consensus" / "Byzantine-fault-tolerant"** — no consensus or BFT logic exists in the decision path. The terms label an animation, not a mechanism. *This is the highest-credibility-risk item:* the protocol the project is named for is not running in the tool.
2. **[VERIFIED] "ECDSA-signed audit ledger"** — no signing; hash is a non-cryptographic scramble and is not chained.
3. **[VERIFIED] "Scales to N=500"** — scale changes the visualization, not the result.
4. **[VERIFIED] Scenario stressors** — attrition partially modeled; Byzantine and packet-loss stressors are inert with respect to outcomes.
5. **[ASSUMPTION] Result tuning** — the `expected` tables appear tuned to the paper's reported numbers; if so, the tool cannot independently corroborate the paper.
6. **[VERIFIED] Cross-artifact disagreement** — sub-quorum size (sqrt(N)) contradicts the formal 2f+1 used in the paper/ICD/TLA+.

**Language to avoid until the mechanism exists:** "demonstrates," "validates," "proves," and "shows that AUTHREX achieves." Until outcomes are emergent, the honest verb is "illustrates." A reviewer who sees "demonstrates" over a tool that samples from a constant table will lose confidence in the surrounding claims too.

**What is credible as-is:** the FLAME window formula, the tier-ceiling enforcement, the pipeline staging, and the overall conceptual model. These are sound; they are just not yet wired to the outcome.

---

## Advanced Capability Recommendations

| Capability | Operational value | Implementation idea | Research value |
|---|---|---|---|
| Real BFT tally | Outcomes become emergent and defensible | Per-node vote with 2f+1 threshold; reject non-verified signatures | Converts the tool from figure to evidence |
| Seeded PRNG + replay | Reproducibility, the precondition for V&V | mulberry32 seeded from a UI field; export seed+events | Enables Monte Carlo studies and peer replication |
| Network/partition model | Realistic denied-environment behavior | Per-link delivery prob, delay, partition events; reconverge | Lets S5 mean something; supports CAP-style analysis |
| Heavy-tailed latency | Honest tail-risk reporting | Lognormal or Pareto consensus delay; p99 tracking | Mission-assurance relevance |
| ADARA detection-latency model | Realistic compromise dynamics | Detection delay distribution; progressive compromise | Studies the detect-isolate race |
| Correlated attrition | Realistic jammer/salvo loss | Spatial kernel + burst events | Resilience envelope estimation |
| Monte Carlo batch mode | Distributions instead of single runs | Headless loop over N seeds; CSV out | Confidence intervals on every metric |
| TLA+ cross-check harness | Ties sim to the formal model | Assert sim never violates S1–S5 invariants at runtime | Connects the two strongest artifacts |
| Sensor confidence scoring | Input-quality awareness | Per-observation confidence into SATA/IFF | Explainability and degraded-input handling |
| Structured event log + viewer | Auditability and debugging | JSON-lines event stream + timeline scrubber | Observability to review standard |

---

## Required Improvement Roadmap

### Critical Fixes (do first; these are correctness/realism blockers)

1. **Make outcomes emergent.** Replace the `expected`-table sampling (lines 914–919) with a real per-decision tally over healthy, in-quorum, signature-verified nodes; require 2f+1 for a T3 commit; downgrade as the healthy margin erodes. (Fixes F1, F5, F7, FM1, FM2.)
2. **Wire the stressors to the outcome.** Make attrition, compromise, and `packetLoss` actually affect the tally. (Fixes F2, FM4.)
3. **Seed the PRNG and add a run ID.** Show the seed; embed it in every export. (Fixes FM10; precondition for all V&V.)
4. **Reconcile sub-quorum size to 2f+1** across the simulator, paper, ICD, and TLA+. (Fixes the cross-artifact contradiction.)
5. **Make the ledger a real chain.** SHA-256 via Web Crypto, each entry over the previous hash; relabel until then. (Fixes F3, FM3.)

### High Priority

6. ECDSA P-256 signing/verification per commit (Web Crypto). (F4.)
7. Network model: per-link delivery probability, delay, and partition with reconvergence. (FM4.)
8. Heavy-tailed latency + p50/p95/p99 tracking and a window-breach counter. (FM5.)
9. Declared test matrix with independent pass/fail criteria per scenario, plus a reference oracle for cross-check.
10. Structured event log (JSON lines) + downloadable run record.

### Medium Priority

11. Decouple sim clock from render clock (fixed-step integrator independent of `requestAnimationFrame`).
12. ADARA detection-latency model and progressive (not instantaneous) compromise.
13. Correlated/burst attrition model.
14. Quorum-health and latency-histogram panels in the UI.

### Long-Term Enhancements

15. Monte Carlo batch mode with confidence intervals.
16. Runtime TLA+ invariant checking (assert S1–S5 never violated during a run).
17. Mobility + range-dependent dual-plane (LoRa/WiFi) link model.
18. Explainability layer: per-decision causal trace ("committed at T2 because 4 of 7 quorum healthy, 1 vote rejected as unsigned").

---

## Final Readiness Assessment

**As a visualization / communication artifact:** strong. Clean, legible, and persuasive; suitable for the website and as an illustrative figure for the paper, *if labeled as illustrative*.

**As a simulation, judged against high-assurance research standards:**

> ### Early Prototype — Needs Major Revision

This grade reflects, specifically and without inflation:
- outcomes are predetermined, not emergent (F1) — the disqualifying issue for any quantitative claim;
- the named core mechanisms (BFT consensus, ECDSA ledger) are not implemented in the decision path (F4, F5, FM2, FM3);
- zero reproducibility (unseeded RNG, no replay) — a hard V&V failure;
- defining scenario stressors are inert (F2);
- a core parameter contradicts the formal artifacts (sub-quorum size).

None of these is a dead end. The conceptual model, the pipeline staging, the FLAME formula, and the tier-ceiling logic are sound foundations. Executing the five Critical Fixes would move the tool to **Strong Research Simulation**, and adding the High-Priority and V&V items (seeded Monte Carlo, reference oracle, runtime invariant checks, real crypto) would bring it to **Near High-Assurance Review Quality**.

**The most important single action:** make the tier distribution emerge from the consensus tally instead of being sampled from a constant. Until that is done, the tool illustrates the claim; after it is done, the tool tests the claim. That distinction is the entire difference between a figure and a simulation, and it is what a NASA, DoD, or DARPA reviewer will look for first.

---

*Prepared as an independent V&V review. Findings are tied to specific source locations and were verified by inspection. Quality was not assumed from the fact that the artifact runs.*
