Why SWE-bench Lite and not SWE-bench Verified?

SWE-bench Lite has wider failure-mode coverage across mutation categories, a published baseline, and existing community calibration. SWE-bench Verified is narrower and filters for tractability — useful but less representative of the real-world distribution Mejepa is calibrated against.

How is Pearson ρ computed?

For each SWE-bench Lite task, Mejepa emits a predicted oracle pass probability p ∈ [0, 1]. The actual oracle outcome y ∈ {0, 1} is determined by running the patched repo against the official Docker test suite. Pearson ρ is computed across the full 300-task set after each training window.

What does 'four-window stable' mean?

Four consecutive training windows must each independently produce ρ ≥ 0.95. A single high window does not fire the gate — variance across windows would indicate the predictor is overfitting to recent data. Four-window stability is the minimum to declare convergence.

What is per-cell stratification?

The ship gate is computed not just on the aggregate ρ but on each (mutation_category × language) cell. There are 11 mutation categories (TypeError, ImportError, off-by-one, hidden state mutation, etc.) and 1 language at ship time (Python). All 11 cells must independently satisfy ρ ≥ 0.95. Each additional language adds a new row of 11 cells.

Methodology · ship-gate math · refreshed 2026-05-20

The math behind the gate.

Mejepa ships when prediction-oracle Pearson correlation ρ ≥ 0.95, stable across four rolling windows, per cell. Current ρ = 0.866667. Here is how the number is computed, what it is measured against, and how an outside auditor can replay it.

A stack of methodology source documents on a walnut desk, each labeled in copperplate with its role in the FSV plan.

THE GATE

One metric. Four windows. Eleven cells.

ρ ≥ 0.95

Pearson correlation threshold

Computed across all 300 SWE-bench Lite tasks. Single training window.

4×

Consecutive rolling windows

Four training cycles each must independently exceed the threshold. One high window does not fire the gate.

11 × N

Per-cell stratification

11 mutation categories × N supported languages. Every cell must independently pass. Ship with Python; expand language by language.

The computation

For each SWE-bench Lite task i, Mejepa emits a predicted oracle pass probability p_i ∈ [0, 1].
The actual oracle outcome y_i ∈ {0, 1} is determined by running the patched repository against the official Docker test suite — the exact procedure published by Princeton NLP for SWE-bench.
Pearson ρ = Cov(p, y) / (σ_p σ_y) is computed across the full 300-task set.
The same ρ is computed for each (mutation_category × language) cell independently — 11 mutation categories, 1 language at ship time.
The gate fires when both the aggregate ρ and every per-cell ρ exceed 0.95, repeated for four consecutive training windows.

Mutation-category stratification

Mejepa stratifies SWE-bench Lite tasks into binary-doctrine mutation categories so the predictor cannot pass the gate by being strong on the common categories and weak on the rare ones. Q4 surfaces (performance regressions, reasoning-class, latent-bug subjective grading) are formally retired as wontfix-ambiguity-boundary . The list below is the FSV-bounded subset:

TypeError mutations — wrong type, None handling, type-coercion bugs
ImportError mutations — missing imports, circular imports, renamed modules
Off-by-one — boundary, slicing, range, len() math
Hidden state mutations — class-level state, default-arg mutation, closure captures
Logic errors — wrong condition, inverted boolean, short-circuit
API misuse — wrong call signature, deprecated method, parameter order
Concurrency / async — race conditions, missing await, GIL assumptions
Edge cases — empty input, single-element, max-size, unicode
Security regressions — injection, traversal, deserialization, secret exposure
Spec drift — docstring divergence, schema change, breaking interface

The full canonical mutation taxonomy resolves through the panel-slot ↔ failure-mode mapping in the FSV plan §1.1 + §1.4. The list above is illustrative; the binding taxonomy is the registry, not this enumeration.

WHY THIS NUMBER, NOT ANOTHER

SWE-bench Lite, not Verified, not SWE-bench+.

SWE-bench has three published variants and we picked one deliberately:

SWE-bench full — 2,294 tasks. Too large for tight iteration cycles; coverage is non-uniform across mutation categories.
SWE-bench Verified — 500 tasks, hand-filtered for tractability. Useful but narrower; filters out the failure modes Mejepa most needs to grade.
SWE-bench Lite — 300 tasks. The OpenReview SWE-Bench+ paper showed that 47.93% of "resolved" instances passed weak tests but were actually broken on stronger ones, which is exactly the failure class Mejepa exists to catch.^[1]

SWE-bench Lite has wider mutation-category coverage, a published community baseline, and a clear failure-mode taxonomy. It is the benchmark a deterministic verifier must beat to claim it is not just another LLM judge.

CONFORMAL CALIBRATION

An honest verdict, not a confident one.

The ship-gate ρ measures point-prediction accuracy. The verdict itself uses a separate calibration step: split conformal prediction.

Reserve 20% of the corpus as a calibration set, stratified per cell.
For each calibration point, compute the residual |p_i − y_i|.
Sort residuals; pick the 95th percentile q_0.95.
At inference time, emit the interval [p − q_0.95, p + q_0.95].
If the interval contains 0.5, the verdict is Abstain — neither Pass nor Fail can be claimed with 95% coverage.
If the patch's embedding sits more than τ standard deviations from any cluster centroid, the verdict is OutOfDistribution regardless of the interval.

Conformal prediction is what gives Abstain a mathematical floor. It is not a heuristic; it is a coverage guarantee. Reference: Vovk, Gammerman, Shafer, Algorithmic Learning in a Random World (Springer, 2005).

REPLAY PROCEDURE

An outside auditor can rerun this.

Pull the published corpus from github.com/mejepa.
Pull the public verification key from mejepa.com/keys.
Run mejepa replay --window N against the audited training window.
The tool re-computes ρ aggregate and per-cell, verifies every verdict's ed25519 signature, and walks the SHAKE-256 witness chain.
Output: a signed audit report stating "the published ρ matches the corpus, every verdict is signed by the published key, the chain is unbroken."

No part of this requires trusting Mejepa. The corpus is public, the key is public, the procedure is published. Trust is the chain, not the company.

REFERENCES

Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Princeton NLP / NeurIPS 2024. arxiv.org/abs/2310.06770
OpenReview — SWE-Bench+: Enhanced Coding Benchmark for LLMs (2025). 47.93% of "resolved" instances passed weak tests but were not actually correct; resolution rate drops from 42.1% to 21.8% after filtering plausible-but-broken patches.
Vovk, Gammerman, Shafer — Algorithmic Learning in a Random World. Springer, 2005. Foundational text on conformal prediction.
LeCun — A Path Towards Autonomous Machine Intelligence (2022). The JEPA position paper. openreview.net
Assran, Duval, Misra, Bojanowski, Vincent, Rabbat, LeCun, Ballas — I-JEPA, ICML 2023. arxiv.org/abs/2301.08243
Anthropic — Model Context Protocol Specification, 2024. modelcontextprotocol.io