Methodology · ship-gate math · refreshed 2026-05-20
Mejepa ships when prediction-oracle Pearson correlation ρ ≥ 0.95, stable across four rolling windows, per cell. Current ρ = 0.866667. Here is how the number is computed, what it is measured against, and how an outside auditor can replay it.
Computed across all 300 SWE-bench Lite tasks. Single training window.
Four training cycles each must independently exceed the threshold. One high window does not fire the gate.
11 mutation categories × N supported languages. Every cell must independently pass. Ship with Python; expand language by language.
pi ∈ [0, 1].yi ∈ {0, 1} is determined by running the patched repository against the official Docker test suite — the exact procedure published by Princeton NLP for SWE-bench.ρ = Cov(p, y) / (σp σy) is computed across the full 300-task set.Mejepa stratifies SWE-bench Lite tasks into binary-doctrine mutation categories so the predictor cannot pass the gate by being strong on the common categories and weak on the rare ones. Q4 surfaces (performance regressions, reasoning-class, latent-bug subjective grading) are formally retired as wontfix-ambiguity-boundary . The list below is the FSV-bounded subset:
The full canonical mutation taxonomy resolves through the panel-slot ↔ failure-mode mapping in the FSV plan §1.1 + §1.4. The list above is illustrative; the binding taxonomy is the registry, not this enumeration.
SWE-bench has three published variants and we picked one deliberately:
SWE-bench Lite has wider mutation-category coverage, a published community baseline, and a clear failure-mode taxonomy. It is the benchmark a deterministic verifier must beat to claim it is not just another LLM judge.
The ship-gate ρ measures point-prediction accuracy. The verdict itself uses a separate calibration step: split conformal prediction.
|pi − yi|.[p − q0.95, p + q0.95].Abstain — neither Pass nor Fail can be claimed with 95% coverage.OutOfDistribution regardless of the interval.Conformal prediction is what gives Abstain a mathematical floor. It is not a heuristic; it is a coverage guarantee. Reference: Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005).
mejepa replay --window N against the audited training window.No part of this requires trusting Mejepa. The corpus is public, the key is public, the procedure is published. Trust is the chain, not the company.
Panel A is measured weekly on the 8/30 holdout. Panel B (cross-panel, #405) is the sole p0 blocker. There is no "soon" — the gate fires when the number fires.