Free tool

A/B Test Significance Calculator

Found a winner in your A/B test — or just noise? Enter visitors and conversions for both variants and we'll run a two-proportion z-test: conversion rates, uplift, z-score, p-value, and a plain-English verdict on whether the difference is statistically significant.

Inputs

Variant A (control)
Variant B (variation)

Results

Statistically significant
Variant B wins at 95% confidence

The +30.0% difference is unlikely to be chance (p = 0.0355).

Conversion rate A
10.00%
Conversion rate B
13.00%
Relative uplift
+30.0%
Change in B vs A.
Absolute difference
+3.00 pts
Z-score
2.103
P-value (two-tailed)
0.0355
Significant if below 0.05.

What the numbers mean

From raw conversions to a confident decision.

Conversion rate & uplift

Each variant's conversion rate is conversions ÷ visitors. Relative uplift is how much B beat (or lost to) A in percentage terms — the headline number, but meaningless until you know it's real and not noise.

Z-score & p-value

We run a two-proportion z-test. The z-score measures how many standard errors apart the two rates are; the two-tailed p-value is the probability of seeing a gap this large if the variants were actually identical.

Significance

A result is significant when the p-value falls below your chosen threshold (5% at 95% confidence). That means the difference is unlikely to be chance — though significance is about reliability, not size.

Why “it's winning” isn't enough

Almost every A/B test shows one variant ahead at any given moment — that's just how random variation works. The question that matters is whether the lead would survive if you ran the test again. Significance testing answers exactly that: it estimates the probability that the gap you're seeing is real rather than a lucky run of conversions.

Calling a winner too early is the most expensive mistake in experimentation, because you ship a change that doesn't actually help and pollute every decision built on top of it. Wait for significance, and when you're ready to act on the winner, size the budget with our ROAS calculator and CAC & LTV calculator.

How AdFlint tests creative for you

AdFlint generates multiple ad variations, runs them across Google, Meta, and LinkedIn, and shifts budget toward the variants that win on real conversion data — so you get the benefit of disciplined testing without manually crunching p-values for every campaign.

Questions

What does statistical significance actually mean?

It means the difference you observed between two variants is unlikely to have happened by random chance. At 95% confidence, a significant result has a p-value below 0.05, meaning there's less than a 5% probability you'd see a gap this large if the two variants truly performed identically. Significance does not tell you the difference is large or important — only that it's probably real. A tiny, significant uplift on huge traffic can be less valuable than a big, not-yet-significant uplift you should keep testing.

How does this calculator compute significance?

It uses a two-proportion z-test. It pools the conversion rates of both variants to estimate a shared standard error, computes a z-score from the difference in rates, and converts that to a two-tailed p-value using the normal distribution. The result is significant if the p-value is below your chosen alpha (0.10, 0.05, or 0.01 for 90%, 95%, or 99% confidence). This is the standard approach for comparing two conversion rates and matches what most A/B testing tools report.

How much traffic do I need for a valid A/B test?

There's no fixed number — it depends on your baseline conversion rate and the size of the difference you want to detect. Smaller effects need far more traffic: detecting a 2% relative uplift can take tens of thousands of conversions, while a 30% uplift may be clear in a few hundred. As a rule of thumb, aim for at least a few hundred conversions per variant before trusting a result, and don't stop the test the moment it crosses significance — early peeking inflates false positives.

Why isn't my result significant even though B is winning?

Because the sample is too small to rule out chance. With low traffic, even a real difference can produce a p-value above your threshold — the data simply can't distinguish a true effect from random variation yet. Keep the test running to gather more visitors and conversions, and the p-value will tighten in whichever direction reflects reality. If it stays inconclusive after substantial traffic, the true difference between variants is probably too small to matter.

Can I use this for ad creative tests, not just landing pages?

Yes. The math is identical for any two-variant test where you can count exposures and successes: ad creatives (impressions and clicks, or clicks and conversions), email subject lines (sends and opens), or landing pages (visitors and signups). Just map your two metrics to 'visitors' and 'conversions.' For ad creative, comparing clicks against impressions tests CTR; comparing conversions against clicks tests post-click performance.

Let AdFlint test your ads automatically

Free to sign up. Multi-variant ad testing across every platform.