Benchmark Methodology

This document describes how AstraWeave performance measurements are collected, validated, and reported.

Measurement Philosophy

“Prove it, don’t hype it.”

Every performance claim in AstraWeave documentation:

Has a command that reproduces it
Captures raw logs for auditing
Uses statistical analysis for reliability
Is validated against real workloads, not synthetic benchmarks

Tools & Infrastructure

Criterion.rs

All microbenchmarks use Criterion.rs for statistical rigor.

Why Criterion:

Statistical analysis with confidence intervals
Outlier detection and filtering
Baseline comparison (catch regressions)
HTML report generation

Location: target/criterion/**/base/estimates.json

Odyssey Runner

For full-suite benchmarking, use the automation script:

./scripts/benchmark_odyssey.ps1 -OutDir benchmark_results/$(Get-Date -Format 'yyyy-MM-dd')

Outputs:

environment.txt - OS/CPU/RAM, rustc/cargo version, git SHA
packages_with_benches.txt - Inventory of benchmarked crates
run_order.txt - Execution order
bench_<package>.log - Raw benchmark output per crate
run_results.json - Success/fail status

Statistical Practices

Confidence Intervals

Criterion provides 95% confidence intervals for all measurements. We report:

Point estimate: The measured mean
Lower bound: 95% CI lower
Upper bound: 95% CI upper

Example: 1.34 ns [1.33 ns, 1.35 ns] means the true mean is 95% likely within that range.

Warm-Up & Iterations

Default Criterion settings:

Warm-up: 3 seconds (eliminates cold-start artifacts)
Measurement: 5 seconds minimum
Sample size: 100 samples minimum

Outlier Handling

Criterion automatically detects and reports outliers:

Mild outliers: 1.5× IQR
Severe outliers: 3× IQR

Outliers are flagged in reports but included in analysis (not discarded).

Benchmark Categories

1. Microbenchmarks

Single-operation measurements (e.g., “how long does vec3_lerp take?”).

Location: crates/*/benches/*.rs

Example:

#![allow(unused)]
fn main() {
fn bench_vec3_lerp(c: &mut Criterion) {
    let a = Vec3::new(0.0, 0.0, 0.0);
    let b = Vec3::new(1.0, 1.0, 1.0);
    c.bench_function("vec3_lerp", |bencher| {
        bencher.iter(|| a.lerp(b, 0.5))
    });
}
}

2. Adversarial Benchmarks

Stress tests for edge cases and worst-case scenarios.

Categories (22 sections):

Gameplay adversarial (massive damage, rapid hits)
Input adversarial (input storms, frame clear)
Math adversarial (IEEE-754 edge cases: infinity, NaN, denormals)
Navigation adversarial (sliver triangles, impossible paths)
Security adversarial (script sandboxing, anti-cheat)
And 17 more…

Purpose: Ensure production stability under extreme conditions.

3. Integration Benchmarks

End-to-end measurements of complete systems.

Example: “Full game loop with 5,000 entities”

#![allow(unused)]
fn main() {
fn bench_full_game_loop(c: &mut Criterion) {
    let mut world = setup_world_with_entities(5000);
    c.bench_function("full_game_loop/5000_entities", |bencher| {
        bencher.iter(|| world.tick(1.0 / 60.0))
    });
}
}

4. Scalability Benchmarks

Measure performance across varying input sizes.

Example: Entity spawn at 10, 100, 1000, 10000 entities.

#![allow(unused)]
fn main() {
fn bench_entity_spawn(c: &mut Criterion) {
    let mut group = c.benchmark_group("entity_spawn");
    for size in [10, 100, 1000, 10000] {
        group.bench_with_input(
            BenchmarkId::from_parameter(size),
            &size,
            |b, &size| b.iter(|| spawn_entities(size))
        );
    }
    group.finish();
}
}

Environment Standardization

Hardware Requirements

Benchmark machines should document:

CPU: Model, cores, clock speed
RAM: Size, speed
OS: Windows/Linux/macOS version
Rust: rustc --version
Profile: Always --release

Isolation Practices

For reliable measurements:

Close unnecessary applications
Disable turbo boost (optional, for consistency)
Run multiple times to verify reproducibility
Use cargo bench -- --noplot to skip HTML generation (faster)

Reporting Standards

Master Benchmark Report

All benchmark results are consolidated in:

docs/masters/MASTER_BENCHMARK_REPORT.md

Update triggers:

Any benchmark changes >10%
New benchmarks added
Performance regressions discovered

Version Tracking

Each report version documents:

Version number (e.g., v5.55)
Date of measurement
Key changes since last version
Critical fixes applied

Regression Detection

Baseline Comparison

# Save current as baseline
cargo bench -p astraweave-ecs -- --save-baseline main

# Compare against baseline
cargo bench -p astraweave-ecs -- --baseline main

CI Integration

GitHub Actions workflow (benchmark.yml) runs benchmarks on:

Pull requests (compare against main)
Nightly builds (detect gradual regressions)

Alert Thresholds

Change	Action
< 5%	Normal variance, no action
5-10%	Flag for review
10-20%	Investigate root cause
> 20%	Block merge, fix required

Coverage Methodology

Test coverage is measured using cargo-llvm-cov:

# Generate coverage report
cargo llvm-cov --workspace --html

# View report
open target/llvm-cov/html/index.html

Coverage by Tier

Tier	Crates	Target	Actual
Tier 1 (Critical)	ecs, core, ai, render	80%	75.3%
Tier 2 (Important)	physics, nav, gameplay	75%	72.6%
Tier 3 (Supporting)	audio, scene, terrain	70%	71.8%
Tier 4 (Specialized)	fluids, llm, prompts	65%	71.5%

Per-Crate Coverage (verified January 2026)

Crate	Coverage	Status
astraweave-ecs	83.2%	✅
astraweave-core	79.1%	✅
astraweave-ai	71.3%	✅
astraweave-render	67.4%	✅
astraweave-physics	76.8%	✅
astraweave-fluids	94.2%	✅ A+
astraweave-nav	72.1%	✅
astraweave-gameplay	68.9%	✅
astraweave-terrain	71.5%	✅
astraweave-audio	69.2%	✅
astraweave-scene	74.6%	✅
astraweave-llm	58.3%	⚠️ Beta

Reproducing Results

Quick Verification

# Verify ECS benchmarks match documentation
cargo bench -p astraweave-ecs -- entity_spawn/empty/10000

# Expected: ~645µs (±10%)

Full Reproduction

Clone repository at documented commit
Run ./scripts/benchmark_odyssey.ps1
Compare benchmark_results/*/ against documented values
Variance > 20% indicates environment difference

AstraWeave: AI-Native Game Engine Documentation