Methodology · ship-gate math · refreshed 2026-05-20

The math behind the gate.

Mejepa ships when prediction-oracle Pearson correlation ρ ≥ 0.95, stable across four rolling windows, per cell. Current ρ = 0.866667. Here is how the number is computed, what it is measured against, and how an outside auditor can replay it.

A stack of methodology source documents on a walnut desk, each labeled in copperplate with its role in the FSV plan.
THE GATE

One metric. Four windows. Eleven cells.

ρ ≥ 0.95
Pearson correlation threshold

Computed across all 300 SWE-bench Lite tasks. Single training window.

Consecutive rolling windows

Four training cycles each must independently exceed the threshold. One high window does not fire the gate.

11 × N
Per-cell stratification

11 mutation categories × N supported languages. Every cell must independently pass. Ship with Python; expand language by language.

The computation

  1. For each SWE-bench Lite task i, Mejepa emits a predicted oracle pass probability pi ∈ [0, 1].
  2. The actual oracle outcome yi ∈ {0, 1} is determined by running the patched repository against the official Docker test suite — the exact procedure published by Princeton NLP for SWE-bench.
  3. Pearson ρ = Cov(p, y) / (σp σy) is computed across the full 300-task set.
  4. The same ρ is computed for each (mutation_category × language) cell independently — 11 mutation categories, 1 language at ship time.
  5. The gate fires when both the aggregate ρ and every per-cell ρ exceed 0.95, repeated for four consecutive training windows.

Mutation-category stratification

Mejepa stratifies SWE-bench Lite tasks into binary-doctrine mutation categories so the predictor cannot pass the gate by being strong on the common categories and weak on the rare ones. Q4 surfaces (performance regressions, reasoning-class, latent-bug subjective grading) are formally retired as wontfix-ambiguity-boundary . The list below is the FSV-bounded subset:

  1. TypeError mutations — wrong type, None handling, type-coercion bugs
  2. ImportError mutations — missing imports, circular imports, renamed modules
  3. Off-by-one — boundary, slicing, range, len() math
  4. Hidden state mutations — class-level state, default-arg mutation, closure captures
  5. Logic errors — wrong condition, inverted boolean, short-circuit
  6. API misuse — wrong call signature, deprecated method, parameter order
  7. Concurrency / async — race conditions, missing await, GIL assumptions
  8. Edge cases — empty input, single-element, max-size, unicode
  9. Security regressions — injection, traversal, deserialization, secret exposure
  10. Spec drift — docstring divergence, schema change, breaking interface

The full canonical mutation taxonomy resolves through the panel-slot ↔ failure-mode mapping in the FSV plan §1.1 + §1.4. The list above is illustrative; the binding taxonomy is the registry, not this enumeration.

WHY THIS NUMBER, NOT ANOTHER

SWE-bench Lite, not Verified, not SWE-bench+.

SWE-bench has three published variants and we picked one deliberately:

SWE-bench Lite has wider mutation-category coverage, a published community baseline, and a clear failure-mode taxonomy. It is the benchmark a deterministic verifier must beat to claim it is not just another LLM judge.

CONFORMAL CALIBRATION

An honest verdict, not a confident one.

The ship-gate ρ measures point-prediction accuracy. The verdict itself uses a separate calibration step: split conformal prediction.

  1. Reserve 20% of the corpus as a calibration set, stratified per cell.
  2. For each calibration point, compute the residual |pi − yi|.
  3. Sort residuals; pick the 95th percentile q0.95.
  4. At inference time, emit the interval [p − q0.95, p + q0.95].
  5. If the interval contains 0.5, the verdict is Abstain — neither Pass nor Fail can be claimed with 95% coverage.
  6. If the patch's embedding sits more than τ standard deviations from any cluster centroid, the verdict is OutOfDistribution regardless of the interval.

Conformal prediction is what gives Abstain a mathematical floor. It is not a heuristic; it is a coverage guarantee. Reference: Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005).

REPLAY PROCEDURE

An outside auditor can rerun this.

  1. Pull the published corpus from github.com/mejepa.
  2. Pull the public verification key from mejepa.com/keys.
  3. Run mejepa replay --window N against the audited training window.
  4. The tool re-computes ρ aggregate and per-cell, verifies every verdict's ed25519 signature, and walks the SHAKE-256 witness chain.
  5. Output: a signed audit report stating "the published ρ matches the corpus, every verdict is signed by the published key, the chain is unbroken."

No part of this requires trusting Mejepa. The corpus is public, the key is public, the procedure is published. Trust is the chain, not the company.

REFERENCES
  1. Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Princeton NLP / NeurIPS 2024. arxiv.org/abs/2310.06770
  2. OpenReview — SWE-Bench+: Enhanced Coding Benchmark for LLMs (2025). 47.93% of "resolved" instances passed weak tests but were not actually correct; resolution rate drops from 42.1% to 21.8% after filtering plausible-but-broken patches.
  3. Vovk, Gammerman, Shafer — Algorithmic Learning in a Random World. Springer, 2005. Foundational text on conformal prediction.
  4. LeCun — A Path Towards Autonomous Machine Intelligence (2022). The JEPA position paper. openreview.net
  5. Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas — I-JEPA, ICML 2023. arxiv.org/abs/2301.08243
  6. Anthropic — Model Context Protocol Specification, 2024. modelcontextprotocol.io

Watch the number cross the gate.

Panel A is measured weekly on the 8/30 holdout. Panel B (cross-panel, #405) is the sole p0 blocker. There is no "soon" — the gate fires when the number fires.