Pre-registration changed how I read my own results

There's a small ritual I keep falling back into. Run an experiment, look at the result, then narrate the experiment in the past tense — as if the result had been the thing I expected all along. The chart looks tidy in retrospect. The hypothesis cleans itself up. The Discussion section, by the time I'm done writing it, is a model of post hoc clarity.

Pre-registration is the thing that broke me of this. Not all the way. Most of the way.

The premise is dumb on its face: write down what you predict before you run the experiment. Save the document. Make it timestamped, ideally in a place you can't quietly edit later. Then run the experiment. After, compare what happened to what you wrote. This sounds like ceremony. It mostly isn't.

The most useful pre-registration I wrote this year was for a follow-up to a finding I was nervous about. We had observed that an MLM LoRA fine-tune of E5-base-v2 on Wikipedia damaged retrieval by about 10 percentage points on MS MARCO. The natural objection was distribution shift: Wikipedia is OOD relative to MS MARCO, so what looks like “MLM damages retrieval” might really just be “the model is adapting to a different topic mix and forgetting the relevant one.”

To rule that out, we needed a corpus control: train MLM LoRA on the same MS MARCO passages contrastive LoRA was trained on, and see if the damage persists. Before running it, I wrote out three hypotheses:

A. Objective-driven. MS MARCO retrieval drops ≥ 5 pp at epoch 50. The loss is doing the damage; the corpus is irrelevant.
B. Corpus-driven. Retrieval stays within base 95% CI. The Wikipedia result was about distribution shift.
C. Mixed. Something in between — 1–5 pp.

The 5 pp threshold for A wasn't arbitrary, but it also wasn't tuned to the data; I picked it before I had seen a single seed. What actually happened was a 10.2 pp drop — twice the threshold. Hypothesis A was clean. The MLM loss damages retrieval regardless of corpus.

Here's the part that mattered. If I had skipped the pre-registration step and just run the experiment, the 10.2 pp number would have felt the same to me. I would have written the same conclusion. But I would have given myself no way to argue, even internally, that I hadn't tuned the threshold downward after seeing the number. The pre-registration was the thing that turned a result I already believed into a result I could actually defend.

The other pre-registration that mattered was the one we failed.

For the BERT cross-family replication, I wrote down a matched-NPS window of [0.28, 0.36]. The contrastive-LoRA BERT condition lived in that band, so any directly comparable MLM-on-BERT condition should too — otherwise it wouldn't be a matched magnitude test.

The realized NPS for BERT MLM came out at 0.836.

That's nowhere near the window. The pre-registration was explicit that an out-of-window result would invalidate the strict matched-magnitude hypothesis test, and it did. Strictly speaking, the headline claim — “geometric drift magnitude doesn't predict functional outcome” — couldn't be tested on this experiment in the form I'd planned.

Without the pre-registration, I would have softened that. I would have written something like: “BERT MLM produces low drift and minimal functional change; BERT contrastive produces high drift and large functional change; the relationship is preserved.” That's a true sentence. It is also a different sentence than the one I'd promised to write.

With the pre-registration, I had to explicitly admit something more interesting: the matched-magnitude framework is not well-defined for a model whose pretraining objective is identical to its fine-tuning objective, because the gradient is approximately zero and nothing moves. The NPS-window violation was itself the finding — a boundary condition on the framework I hadn't seen until the experiment forced me to.

That was a better result than the one I would have written without the constraint. It just didn't feel like one until later.

The thing pre-registration changes is not really which experiments you run. The experiments are the same. What changes is the asymmetry between you-before and you-after.

Before the data, you have a hypothesis. After the data, you have a result and an explanation. If you don't separate those two stages with a timestamp, the explanation will, almost without exception, expand to absorb the result. You will read your data as a confirmation of the thing you now believe.

A timestamped document on disk is a small piece of evidence in a future argument with yourself. That's all. It's not enough to make you honest. It is just enough to make you traceable.