Research Methodology

Evaluation Protocol

Standardized evaluation methodology for the authority-governed autonomy research program. This document defines all scenarios, metrics, baselines, assumptions, and reproducibility criteria used across the governance architecture experiments.

What This Document Covers

This evaluation protocol applies to all simulation-based experiments across seven governance architectures (SATA, HMAA, CARA, MAIVA, FLAME, ADARA, ERAM) and six physical research platforms (Rover Testbed, UAV Platform, BLADE-EDGE, BLADE-AV, BLADE-MARITIME, BLADE-INFRA). It defines the methodology for assessing governance correctness, safety performance, and system behavior under adversarial conditions.

Current evaluation is simulation-based. Hardware-in-the-loop validation and physical testing are planned as future work. All claims are scoped to the simulation environment unless explicitly stated otherwise.

Assumptions and Constraints

A1. Sensor model fidelity. Simulated sensors approximate physical sensor behavior but do not capture all real-world noise characteristics. Sensor fault injection uses parameterized models (Gaussian noise, step faults, drift ramps) rather than physics-based sensor simulation.
A2. Computation timing. Simulation assumes idealized computation timing (~100Hz control loop). Real hardware may introduce jitter, communication delays, and processing bottlenecks not modeled in simulation.
A3. Adversary model. Adversarial scenarios use predefined attack patterns (spoofing, jamming, drift injection). Adaptive adversaries that modify their strategy based on system response are not modeled in the current evaluation.
A4. Single-platform focus. MAIVA multi-agent evaluation uses simulated agent populations. Physical multi-agent experiments (actual drone swarms) are future work.
A5. Deterministic governance. The governance pipeline is fully deterministic: identical inputs always produce identical outputs. No stochastic components exist in authority computation, recovery, or escalation control.

Adversarial Scenario Definitions

Seven standardized adversarial scenarios are used across all evaluations. Each scenario targets a specific sensor modality or combination to test governance response:

Scenario Target Attack Description Expected Response n (runs) Severity
S1. Camera OcclusionCameraProgressive visual obstruction (0% → 100% over 5s)A3 → A2 → A150Medium
S2. LiDAR SpoofingLiDARPhantom point cloud injection (false obstacles)A3 → A1, cross-val alert50High
S3. IMU DriftIMUGradual orientation drift (+0.5 deg/s)Slow trust decay50Low-Medium
S4. RF JammingGPS, LoRaCommunication link disruption (complete blackout)A3 → A2, local nav50Medium
S5. Compound AttackCamera + LiDARSimultaneous multi-sensor degradationA3 → A0, CARA50Critical
S6. Cross-SensorLiDAR vs CameraSingle sensor disagrees with all othersTrust penalty, A3 → A250Medium
S7. RecoveryPost-lockoutFault clearance after A0, monitor GREPCARA GREP → A350N/A

Each scenario runs 50 times with varied fault injection timing (uniform random within the first 30s of each 90s simulation run) and intensity (parameterized within scenario-specific bounds). Total: 350 runs for Rover Testbed, 250 runs (5 scenarios x 50) for UAV Platform.

Performance Metrics

Five primary metrics assess governance performance. All metrics are computed per-run and aggregated with mean and standard deviation across each scenario:

Metric Definition Target Unit
Unsafe Action RatePercentage of runs where the system executed a command that violated the authority envelope for the current trust level< 5%%
False Lockout RatePercentage of runs where authority was revoked (A0) despite all sensors reporting accurate data< 10%%
Mean Recovery TimeAverage time from A0 lockout to full authority restoration (A3) through CARA GREP phasesReportseconds
Authority Transition CorrectnessPercentage of authority state transitions that match the specification (correct direction, hysteresis enforced, no skips)> 99%%
Detection LatencyTime from fault injection to first authority degradation response< 2sseconds

Metric Computation Formulas

Exact definitions used to compute each metric. These formulas are applied identically across all governance methods and baselines:

Unsafe Action Rate = (runs with ≥1 command exceeding authority envelope) / (total runs)
False Lockout Rate = (runs where A=0 triggered AND all sensors reporting valid data) / (total runs)
Mean Recovery Time = mean(t_A3_restored - t_A0_triggered) across all runs with lockout events
Transition Correctness = (transitions matching spec: correct direction + hysteresis enforced + no skips) / (total transitions)
Detection Latency = mean(t_first_authority_change - t_fault_injected) across all fault-injected runs

"Exceeding authority envelope" means any actuator command that violates the speed, turn rate, or action constraints defined for the current authority level. A command at A2 speed limits executed while authority is A1 counts as one unsafe action for that run.

Baseline Methods

Three baseline governance approaches are simulated using the same sensor fault profiles to provide fair comparison. Baselines are reimplemented in the same simulation framework:

Baseline 1: Binary Threshold

Simple threshold on fused sensor confidence. Above threshold: full autonomy. Below threshold: complete halt. No intermediate authority states. No structured recovery. Represents current practice in many deployed systems.

Baseline 2: ML Anomaly Detection

Trained anomaly classifier on sensor data features. Binary output (normal/anomalous) triggers authority restriction. Supervised model trained on 70% of fault scenarios, tested on 30%. Represents ML-based fault detection without formal governance.

Baseline 3: Simplex Switching

Binary switching between a complex controller (full autonomy) and a verified simple controller (safe-stop). Based on Sha (2001). No intermediate authority states, but provides verified safe fallback. Represents the state of the art in runtime verification-based safety.

Baseline Configurations

Exact parameterization of each baseline. Baselines were implemented using standard configurations and may not represent optimal tuning for each method:

Binary Threshold: if fused_confidence > 0.6 then FULL_AUTO else HALT
   confidence = mean(sensor_health_scores), no cross-validation
   no hysteresis, no recovery protocol, no intermediate states

ML Anomaly Detection: logistic regression on feature vector [τ, Δτ, cross_agreement, temporal_var]
   trained on 70% of scenario data (245 runs), tested on 30% (105 runs)
   binary output: normal → full auto, anomaly → halt + restart after 5s cooldown

Simplex Switching: verified safe controller: immediate halt + return-to-start at 10% speed
   switch trigger: any sensor health < 0.3
   switch-back: all sensors > 0.7 for 3s, binary (no graduated recovery)

All baselines use the same simulated sensor data streams as the full governance pipeline, ensuring differences in performance are attributable to the governance method rather than sensor input variation. Baselines were implemented using standard configurations and may not represent optimal tuning. A more sophisticated ML model or optimized Simplex configuration could potentially improve baseline performance.

Aggregate Performance

Results aggregated across all 350 Rover Testbed runs (7 scenarios x 50 runs each). Values are mean ± standard deviation:

Method Unsafe Actions False Lockouts Recovery Time Transition Correctness Detection Latency
Binary Threshold42.3% ± 4.128.1% ± 3.8N/AN/A0.8s ± 0.3
ML Anomaly Detection27.4% ± 5.212.3% ± 2.914.8s ± 6.1N/A2.1s ± 1.4
Simplex Switching18.7% ± 3.415.6% ± 3.17.9s ± 2.3100%0.5s ± 0.2
SATA-HMAA-CARA3.4% ± 1.24.8% ± 1.631.2s ± 8.499.7%0.6s ± 0.2
Full Pipeline (+ADARA+FLAME)2.1% ± 0.94.8% ± 1.631.2s ± 8.499.7%0.6s ± 0.2

Key tradeoff: Recovery time for the governance pipeline (~31s) is significantly longer than Simplex (~8s) because it is structured and graduated rather than binary restart. This is an intentional design tradeoff: safety verification during recovery takes time. The 2.1% remaining unsafe actions occur in edge cases where fault injection timing coincides exactly with authority transition boundaries.

How to Reproduce These Results

Deterministic Guarantee

All published benchmark results are generated using fixed seeds. All stochastic elements (fault injection timing, noise profiles) are controlled via seeded PRNG. Math.random() is not used in benchmark-critical computation paths. The governance pipeline itself (SATA trust fusion, HMAA authority computation, CARA recovery logic) contains zero stochastic components: identical inputs always produce identical outputs regardless of execution environment or timing.

Evidence Classification

All claims in this research are explicitly classified by evidence type:

Formal Guarantees
THEORETICAL
Properties verified by TLA+ model checking over finite state spaces. Valid within the model; physical validity requires hardware confirmation.
Simulation Results
EMPIRICAL
Performance metrics from deterministic simulation with fixed seeds. Sensor models are parameterized approximations, not physics-based.
Hardware Validation
EXPERIMENTAL (PLANNED)
Physical testbed results. Currently design-complete with hardware assembly in progress. No physical experimental data yet published.

Simulation Environment

Runtime: Browser (client-side JS)
Control loop: ~100Hz
Run duration: 90s per scenario
Deterministic: Yes (fixed-seed PRNG)
Monte Carlo: No (deterministic replay)
Variance source: Fault injection timing

Available Artifacts

9 interactive simulations (browser-based)
8 Zenodo DOI-registered artifact packages
Configuration files (JSON) per platform
Hardware BOMs (CSV) for both testbeds
TLA+ specifications (HMAA, MAIVA)
All published under CC BY 4.0

To reproduce: open any simulation page, use the preset scenario buttons matching the 7 defined scenarios, and verify that the authority trajectory matches the documented behavior. All simulations are deterministic: identical inputs always produce identical outputs.

Each simulation supports single-architecture mode (individual governance module only) and full pipeline mode (all governance modules integrated). Both configurations are available in every simulation and the evaluation protocol.

What This Evaluation Does NOT Cover

Physical hardware validation. All current results are simulation-based. Real-world sensor noise, timing jitter, and hardware-specific failure modes are not captured.
Adaptive adversaries. Current scenarios use predefined attack patterns. Adversaries that observe and adapt to governance responses are not modeled.
Multi-platform multi-agent. MAIVA evaluation uses simulated agent populations, not physically distributed systems.
Long-duration missions. Evaluation runs are 90s each. Long-duration effects (sensor drift over hours, Bayesian prior accumulation) are not assessed.
Formal correctness proof. TLA+ model checking verifies properties over finite state spaces. Full mathematical proof of all guarantees under all conditions is future work.

About

This evaluation protocol is part of the authority-governed autonomy research program by Burak Oktenli at Georgetown University (M.P.S. Applied Intelligence). All evaluation artifacts, simulation code, and result data are published on Zenodo under CC BY 4.0.

Related: Full Research Portfolio · HMAA Architecture · Rover Testbed · UAV Platform · BLADE-EDGE · BLADE-AV · BLADE-MARITIME · BLADE-INFRA · All Repositories