Evaluation Protocol, Authority-Governed Autonomy, Burak Oktenli

Scope

What This Document Covers

This evaluation protocol applies to all simulation-based experiments across seven governance architectures (SATA, HMAA, CARA, MAIVA, FLAME, ADARA, ERAM) and twelve physical research platforms (Rover Testbed, UAV Platform, BLADE-EDGE, BLADE-AV, BLADE-MARITIME, BLADE-INFRA, BLADE-SPACE, BLADE-CUAS, BLADE-AGENT-HSM, BLADE-SWARM, BLADE-INFRA-OT, BLADE-FINANCE). It defines the methodology for assessing governance correctness, safety performance, and system behavior under adversarial conditions.

Current evaluation is simulation-based. Hardware-in-the-loop validation and physical testing are planned as future work. All claims are scoped to the simulation environment unless explicitly stated otherwise.

Foundational Assumptions

Assumptions and Constraints

A1. Sensor model fidelity. Simulated sensors approximate physical sensor behavior but do not capture all real-world noise characteristics. Sensor fault injection uses parameterized models (Gaussian noise, step faults, drift ramps) rather than physics-based sensor simulation.

A2. Computation timing. Simulation assumes idealized computation timing (~100Hz control loop). Real hardware may introduce jitter, communication delays, and processing bottlenecks not modeled in simulation.

A3. Adversary model. Adversarial scenarios use predefined attack patterns (spoofing, jamming, drift injection). Adaptive adversaries that modify their strategy based on system response are not modeled in the current evaluation.

A4. Single-platform focus. MAIVA multi-agent evaluation uses simulated agent populations. Physical multi-agent experiments (actual drone swarms) are future work.

A5. Deterministic governance. The governance pipeline is fully deterministic: identical inputs always produce identical outputs. No stochastic components exist in authority computation, recovery, or escalation control.

Test Scenarios

Adversarial Scenario Definitions

Seven standardized adversarial scenarios are used across all evaluations. Each scenario targets a specific sensor modality or combination to test governance response:

Scenario	Target	Attack Description	Expected Response	n (runs)	Severity
S1. Camera Occlusion	Camera	Progressive visual obstruction (0% → 100% over 5s)	A3 → A2 → A1	50	Medium
S2. LiDAR Spoofing	LiDAR	Phantom point cloud injection (false obstacles)	A3 → A1, cross-val alert	50	High
S3. IMU Drift	IMU	Gradual orientation drift (+0.5 deg/s)	Slow trust decay	50	Low-Medium
S4. RF Jamming	GPS, LoRa	Communication link disruption (complete blackout)	A3 → A2, local nav	50	Medium
S5. Compound Attack	Camera + LiDAR	Simultaneous multi-sensor degradation	A3 → A0, CARA	50	Critical
S6. Cross-Sensor	LiDAR vs Camera	Single sensor disagrees with all others	Trust penalty, A3 → A2	50	Medium
S7. Recovery	Post-lockout	Fault clearance after A0, monitor GREP	CARA GREP → A3	50	N/A

Each scenario runs 50 times with varied fault injection timing (uniform random within the first 30s of each 90s simulation run) and intensity (parameterized within scenario-specific bounds). Total: 350 runs for Rover Testbed, 250 runs (5 scenarios x 50) for UAV Platform.

Evaluation Metrics

Performance Metrics

Five primary metrics assess governance performance. All metrics are computed per-run and aggregated with mean and standard deviation across each scenario:

Metric	Definition	Target	Unit
Unsafe Action Rate	Percentage of runs where the system executed a command that violated the authority envelope for the current trust level	< 5%	%
False Lockout Rate	Percentage of runs where authority was revoked (A0) despite all sensors reporting accurate data	< 10%	%
Mean Recovery Time	Average time from A0 lockout to full authority restoration (A3) through CARA GREP phases	Report	seconds
Authority Transition Correctness	Percentage of authority state transitions that match the specification (correct direction, hysteresis enforced, no skips)	> 99%	%
Detection Latency	Time from fault injection to first authority degradation response	< 2s	seconds

Metric Computation Formulas

Exact definitions used to compute each metric. These formulas are applied identically across all governance methods and baselines:

Unsafe Action Rate = (runs with ≥1 command exceeding authority envelope) / (total runs)
False Lockout Rate = (runs where A=0 triggered AND all sensors reporting valid data) / (total runs)
Mean Recovery Time = mean(t_A3_restored - t_A0_triggered) across all runs with lockout events
Transition Correctness = (transitions matching spec: correct direction + hysteresis enforced + no skips) / (total transitions)
Detection Latency = mean(t_first_authority_change - t_fault_injected) across all fault-injected runs

"Exceeding authority envelope" means any actuator command that violates the speed, turn rate, or action constraints defined for the current authority level. A command at A2 speed limits executed while authority is A1 counts as one unsafe action for that run.

Comparative Analysis

Baseline Methods

Three baseline governance approaches are specified for comparison against the AUTHREX pipeline. Each is defined for reimplementation in the same simulation framework so that differences would be attributable to the governance method rather than sensor input variation:

Baseline 1: Binary Threshold

Simple threshold on fused sensor confidence. Above threshold: full autonomy. Below threshold: complete halt. No intermediate authority states. No structured recovery. Represents current practice in many deployed systems.

Baseline 2: ML Anomaly Detection

Trained anomaly classifier on sensor data features. Binary output (normal/anomalous) triggers authority restriction. Supervised model trained on 70% of fault scenarios, tested on 30%. Represents ML-based fault detection without formal governance.

Baseline 3: Simplex Switching

Binary switching between a complex controller (full autonomy) and a verified simple controller (safe-stop). Based on Sha (2001). No intermediate authority states, but provides verified safe fallback. Represents the state of the art in runtime verification-based safety.

Baseline Configurations

Exact parameterization of each baseline. Baselines are specified using standard configurations and may not represent optimal tuning for each method:

Binary Threshold: if fused_confidence > 0.6 then FULL_AUTO else HALT
   confidence = mean(sensor_health_scores), no cross-validation
   no hysteresis, no recovery protocol, no intermediate states

ML Anomaly Detection: logistic regression on feature vector [τ, Δτ, cross_agreement, temporal_var]
   trained on 70% of scenario data (245 runs), tested on 30% (105 runs)
   binary output: normal → full auto, anomaly → halt + restart after 5s cooldown

Simplex Switching: verified safe controller: immediate halt + return-to-start at 10% speed
   switch trigger: any sensor health < 0.3
   switch-back: all sensors > 0.7 for 3s, binary (no graduated recovery)

Each baseline would use the same simulated sensor data streams as the full governance pipeline, so that differences would be attributable to the governance method rather than sensor input variation. Baselines are specified using standard configurations and may not represent optimal tuning. A more sophisticated ML model or optimized Simplex configuration could potentially improve baseline performance.

Pre-Registered Comparison Protocol

Planned Comparative Evaluation

The comparison of the AUTHREX pipeline against the baselines above is specified here as a pre-registered protocol. The design is fixed in advance; comparative rates are not claimed at this time and are reported as future work once the baseline implementations are executed.

Protocol element	Specification
Regimes	Binary Threshold, ML Anomaly Detection, Simplex Switching, and the SATA-HMAA-CARA pipeline
Scenarios	7 adversarial fault-injection families (GPS spoofing, sensor occlusion, RF loss, IMU disagreement, adversarial ML, Byzantine, compound)
Runs	100 per scenario per regime; identical seeded sensor streams across regimes
Primary measure	Unsafe-action rate: fraction of adversarial decision points at which actuation authority is granted while ground truth is unsafe-to-act
Secondary measures	False-lockout rate, recovery time, authority-transition correctness, detection latency
Statistics	Binomial proportion with Wilson 95% confidence intervals

No comparative rates are asserted. A prior static results table was withdrawn because the baseline implementations required to produce it were not part of the released artifact; publishing specific rates without the runnable baselines would not be independently reproducible. Demonstrated in simulation; not independently validated.

Reproducibility

How to Reproduce These Results

Deterministic Guarantee

All published benchmark results are generated using fixed seeds. All stochastic elements (fault injection timing, noise profiles) are controlled via seeded PRNG. Math.random() is not used in benchmark-critical computation paths. The governance pipeline itself (SATA trust fusion, HMAA authority computation, CARA recovery logic) contains zero stochastic components: identical inputs always produce identical outputs regardless of execution environment or timing.

Evidence Classification

All claims in this research are explicitly classified by evidence type:

Formal Guarantees

THEORETICAL

Properties verified by TLA+ model checking over finite state spaces. Valid within the model; physical validity requires hardware confirmation.

Simulation Results

EMPIRICAL

Performance metrics from deterministic simulation with fixed seeds. Sensor models are parameterized approximations, not physics-based.

Hardware Validation

EXPERIMENTAL (PLANNED)

Physical testbed results. Currently design-complete with hardware assembly in progress. No physical experimental data yet published.

Simulation Environment

Runtime: Browser (client-side JS)
Control loop: ~100Hz
Run duration: 90s per scenario
Deterministic: Yes (fixed-seed PRNG)
Monte Carlo: No (deterministic replay)
Variance source: Fault injection timing

Available Artifacts

19 interactive simulations (browser-based)
24 Zenodo DOI-registered artifact packages
Configuration files (JSON) per platform
Hardware BOMs (CSV) for both testbeds
TLA+ specifications (HMAA, MAIVA)
All published under CC BY 4.0

To reproduce: open any simulation page, use the preset scenario buttons matching the 7 defined scenarios, and verify that the authority trajectory matches the documented behavior. All simulations are deterministic: identical inputs always produce identical outputs.

Each simulation supports single-architecture mode (individual governance module only) and full pipeline mode (all governance modules integrated). Both configurations are available in every simulation and the evaluation protocol.

Scope Limitations

What This Evaluation Does NOT Cover

Physical hardware validation. All current results are simulation-based. Real-world sensor noise, timing jitter, and hardware-specific failure modes are not captured.

Adaptive adversaries. Current scenarios use predefined attack patterns. Adversaries that observe and adapt to governance responses are not modeled.

Multi-platform multi-agent. MAIVA evaluation uses simulated agent populations, not physically distributed systems.

Long-duration missions. Evaluation runs are 90s each. Long-duration effects (sensor drift over hours, Bayesian prior accumulation) are not assessed.

Formal correctness proof. TLA+ model checking verifies properties over finite state spaces. Full mathematical proof of all guarantees under all conditions is future work.

About

This evaluation protocol is part of the authority-governed autonomy research program by Burak Oktenli at Georgetown University (M.P.S. Applied Intelligence). All evaluation artifacts, simulation code, and result data are published on Zenodo under CC BY 4.0.