The A/B Test That Ran 3 Months Past Statistical Significance
What experimental discipline reveals about attachment to outcomes
Have a question about this? Bring it to Aurelius.
The A/B Test That Ran 3 Months Past Statistical Significance
47 days after their experiment crossed the threshold of statistical significance, a product team kept the test running for another 91 days. They were not gathering more information. They were waiting for the data to change its mind.
This is not a story about methodology. It is a story about what happens when the will to be right overwhelms the discipline to see clearly — and how that failure compounds silently across quarters until it is visible only in the rearview.
The same force that held this team captive lives inside almost every analytics organization I have encountered. It does not announce itself. It arrives dressed as diligence.
The Discipline That Has to Be Decided Before the Data Arrives
Experimental integrity is not something you exercise in the moment of uncomfortable results. It cannot be. By then, the hypothesis you fell in love with is already whispering reasons to wait one more week.
The teams that move cleanly through experimentation share one habit: they determine their stopping rules — sample size, significance threshold, minimum detectable effect — before the test begins, when reason is uncontaminated by preference. The discipline is not about statistics. It is about deciding, in a calm hour, what you will do when a difficult hour arrives.
"Optional stopping" is what statisticians call the opposite practice: ending an experiment only when the numbers finally say what you hoped they would say. The name is clinical. The experience is more familiar than most teams admit. One week of unfavorable results becomes two. Two becomes eight. The test continues not because there is methodological reason to extend it, but because closing it would require accepting that the checkout redesign, the pricing change, the feature the roadmap was built around — did not work.
The product manager who hypothesized a 15% lift and sees a 3% decline in week two is not looking at a preliminary result. She is looking at a result. The test that follows is not research. It is negotiation with reality — and reality does not negotiate.
Organizations that allow experiments to drift this way make poor product decisions at significantly higher rates than those with disciplined stopping procedures. The financial cost is real. But the deeper cost is structural: every extended test is a signal to the team that the rules bend when the results are inconvenient. That signal is heard. It shapes what people propose next, how they frame hypotheses, which results they surface and which they bury. The corruption spreads quietly, long before it shows up in a roadmap review.
The fix is unglamorous. Write the stopping rules into the experiment brief. Have a second person sign off before the test launches. Make "the test ended because we hit our predetermined threshold" a complete sentence — one that requires no apology and no asterisk. Run Experiments That Survive Your CFO's Budget Scrutiny Next Quarter covers how to structure this rigor in environments where pressure to show results is constant and timelines are always shorter than they should be.
What Aurelius Sees in This
In Book IV, 3 of the Meditations, Marcus writes: "Do not indulge in dreams of what you have not, but count up the chief of the blessings you do have." It is easy to read this as consolation. It is not. It is a precise instruction about where to direct the attention of the hegemonikon — the ruling faculty, the seat of judgment inside you — and what happens when you allow that faculty to be hijacked by appetite.
The Stoics drew a hard line between what is up to us and what is not. This is the dichotomy of control, and it is not a comfort. It is a demand. What is up to us: the criteria we set, the moment we declare the test complete, the judgment we render on the evidence. What is not up to us: whether the evidence confirms what we wanted. The product team that ran their test 91 days past significance had quietly, without ever announcing it, moved their energy from the first category to the second. They were no longer governing their judgment. They were waiting on the world to relent.
This reveals something that most conventional advice about experimentation misses entirely: the problem is not that people lack statistical knowledge. Nearly every team that runs optional stopping past significance knows, somewhere, that they are doing it. The problem is that the hegemonikon — the part of you responsible for clear seeing — has been compromised by attachment to a particular outcome. Marcus would call this a failure of the examined life. Not a methodological failure. A philosophical one.
The Stoic principle of premeditatio malorum — the deliberate rehearsal of difficulty before it arrives — is exactly what pre-registered stopping rules are, translated into experimental practice. You imagine, before the test begins, the moment you will see an unfavorable result. You decide in advance what you will do. You commit to that decision while your judgment is still clean. This is not pessimism. It is the preparation of a commander who knows the battle will not go entirely as planned and refuses to be surprised into paralysis when it doesn't.
This means that the discipline of stopping an experiment on schedule is, in Stoic terms, a practice of virtue — specifically the virtue of justice toward your future self and your organization. Every week you run a test past its threshold is a week you are borrowing false certainty from time, and charging the interest to every decision downstream.
What most people miss here — the harder truth — is that the desire to keep the test running does not feel like weakness. It feels responsible. It feels like thoroughness. The person extending the experiment believes, sincerely, that they are being careful. Marcus would recognize this immediately. In Book II, 16, he warns against the mind that mistakes its own agitation for deliberation. The busyness of waiting is not the same as the work of knowing. Flourishing in an analytics organization, as in a life, requires the willingness to close what is finished and move toward what is next — even when, especially when, what is finished did not deliver what you wanted.
Therefore: the question to ask yourself is not "do we have enough data yet?" It is "have we already decided, and are we now pretending we haven't?" Those are different questions. The second one is the one worth sitting with.
The Organizational Pattern Behind the Single Test
One extended experiment is a mistake. A culture of extended experiments is a structure. It is worth asking which one you have.
The signs of structural drift are specific. Experiments routinely run past their original end dates with no documented reason. Post-mortems focus on what went wrong with the test rather than on what was learned. Teams frame neutral or negative results as "inconclusive" rather than closing them cleanly. Hypotheses get revised mid-test rather than retired and replaced. Each of these, individually, can be explained. Together they describe an organization that has built its decision process around the avoidance of uncomfortable conclusions.
The cost compounds in ways that quarterly reviews rarely capture. Engineering time holds in limbo while results "mature." Roadmap decisions defer to experiments that will never actually close. New ideas wait behind old ones. The opportunity cost is invisible on any dashboard, but it is the most expensive item on the real budget.
Two places where this pattern tends to originate deserve attention. The first is leadership that responds to negative results with blame rather than curiosity. When a team learns, implicitly or explicitly, that a failed test is a career event rather than a data point, optional stopping becomes rational self-protection. The second is the absence of any ceremony around clean test closure. Organizations that celebrate shipping features rarely celebrate the disciplined end of an experiment that didn't pan out. Run the Decision Test on Every Metric in Your Current Report is a useful starting point for building that discipline into the reporting layer, where it is hardest to avoid and easiest to institutionalize.
The structural fix requires two things: visible stopping rules that live in the experiment brief and are reviewed by someone outside the team, and explicit recognition when a team closes a test cleanly on a negative result and moves on. Neither of these is complicated. Both are almost universally absent.
What to Do This Week
Before you close this tab, find one experiment currently running in your organization and ask three questions about it.
First: does it have a documented stopping rule — a specific sample size, significance threshold, or calendar date — that was written before the test launched? Not implied. Written.
Second: who, besides the team that owns the hypothesis, has visibility into whether that rule is being followed?
Third: if the test were to end today with the results as they currently stand, what decision would be made? If the answer is "we'd probably keep it running a bit longer," you are already inside the pattern described here.
If you find a test that should be closed, close it. Write one sentence explaining why. Send it to your team. The act of naming what you are doing — "we are closing this test because it met our predetermined threshold and the result is the result" — is the beginning of the culture you are trying to build.
For teams that want to go further, Run Experiments That Survive Your CFO's Budget Scrutiny Next Quarter builds out the full framework for pre-registration and stopping discipline in high-pressure environments. The goal is not more rigorous statistics. It is cleaner decisions made by people who trust their own process.
Explore Further
- Run Experiments That Survive Your CFO's Budget Scrutiny Next Quarter — How to build stopping rules that hold under organizational pressure, with structures your finance and engineering partners will respect.
- Run the Decision Test on Every Metric in Your Current Report — A prompt for auditing whether the metrics you are tracking are actually connected to decisions, or are filling space where decisions should be.
- Diagnose Whether Your Metrics Report Arrives Too Late to Matter — Timing is its own form of experimental integrity. If the report lands after the decision is already made, the discipline applied to the data is largely ceremonial.
Frequently Asked Questions
Why do organizations continue A/B tests past statistical significance?
What are stopping rules and why do they matter?
How can teams build better experimental discipline?
Go deeper with Aurelius
Apply this to your actual situation. Aurelius will meet you where you are.
Start a session