Standardized evaluation methodology for the authority-governed autonomy research program. This document defines all scenarios, metrics, baselines, assumptions, and reproducibility criteria used across the governance architecture experiments.
This evaluation protocol applies to all simulation-based experiments across seven governance architectures (SATA, HMAA, CARA, MAIVA, FLAME, ADARA, ERAM) and six physical research platforms (Rover Testbed, UAV Platform, BLADE-EDGE, BLADE-AV, BLADE-MARITIME, BLADE-INFRA). It defines the methodology for assessing governance correctness, safety performance, and system behavior under adversarial conditions.
Current evaluation is simulation-based. Hardware-in-the-loop validation and physical testing are planned as future work. All claims are scoped to the simulation environment unless explicitly stated otherwise.
Seven standardized adversarial scenarios are used across all evaluations. Each scenario targets a specific sensor modality or combination to test governance response:
Each scenario runs 50 times with varied fault injection timing (uniform random within the first 30s of each 90s simulation run) and intensity (parameterized within scenario-specific bounds). Total: 350 runs for Rover Testbed, 250 runs (5 scenarios x 50) for UAV Platform.
Five primary metrics assess governance performance. All metrics are computed per-run and aggregated with mean and standard deviation across each scenario:
Exact definitions used to compute each metric. These formulas are applied identically across all governance methods and baselines:
"Exceeding authority envelope" means any actuator command that violates the speed, turn rate, or action constraints defined for the current authority level. A command at A2 speed limits executed while authority is A1 counts as one unsafe action for that run.
Three baseline governance approaches are simulated using the same sensor fault profiles to provide fair comparison. Baselines are reimplemented in the same simulation framework:
Simple threshold on fused sensor confidence. Above threshold: full autonomy. Below threshold: complete halt. No intermediate authority states. No structured recovery. Represents current practice in many deployed systems.
Trained anomaly classifier on sensor data features. Binary output (normal/anomalous) triggers authority restriction. Supervised model trained on 70% of fault scenarios, tested on 30%. Represents ML-based fault detection without formal governance.
Binary switching between a complex controller (full autonomy) and a verified simple controller (safe-stop). Based on Sha (2001). No intermediate authority states, but provides verified safe fallback. Represents the state of the art in runtime verification-based safety.
Exact parameterization of each baseline. Baselines were implemented using standard configurations and may not represent optimal tuning for each method:
All baselines use the same simulated sensor data streams as the full governance pipeline, ensuring differences in performance are attributable to the governance method rather than sensor input variation. Baselines were implemented using standard configurations and may not represent optimal tuning. A more sophisticated ML model or optimized Simplex configuration could potentially improve baseline performance.
Results aggregated across all 350 Rover Testbed runs (7 scenarios x 50 runs each). Values are mean ± standard deviation:
Key tradeoff: Recovery time for the governance pipeline (~31s) is significantly longer than Simplex (~8s) because it is structured and graduated rather than binary restart. This is an intentional design tradeoff: safety verification during recovery takes time. The 2.1% remaining unsafe actions occur in edge cases where fault injection timing coincides exactly with authority transition boundaries.
All published benchmark results are generated using fixed seeds. All stochastic elements
(fault injection timing, noise profiles) are controlled via seeded PRNG. Math.random() is not
used in benchmark-critical computation paths. The governance pipeline itself (SATA trust fusion,
HMAA authority computation, CARA recovery logic) contains zero stochastic components: identical
inputs always produce identical outputs regardless of execution environment or timing.
All claims in this research are explicitly classified by evidence type:
To reproduce: open any simulation page, use the preset scenario buttons matching the 7 defined scenarios, and verify that the authority trajectory matches the documented behavior. All simulations are deterministic: identical inputs always produce identical outputs.
Each simulation supports single-architecture mode (individual governance module only) and full pipeline mode (all governance modules integrated). Both configurations are available in every simulation and the evaluation protocol.
This evaluation protocol is part of the authority-governed autonomy research program by Burak Oktenli at Georgetown University (M.P.S. Applied Intelligence). All evaluation artifacts, simulation code, and result data are published on Zenodo under CC BY 4.0.
Related: Full Research Portfolio · HMAA Architecture · Rover Testbed · UAV Platform · BLADE-EDGE · BLADE-AV · BLADE-MARITIME · BLADE-INFRA · All Repositories