Why Your Marketing Split Testing Is Wrong

One of the interview questions I used to ask marketers interested in joining my team was “What’s a p-value?” Not a single person I ever interviewed had an answer that was correct as far as I can remember. In fact, they had never even heard of a p-value. Yet, many of those same folks listed A/B testing (also known as split testing) on their resumes. That’s scary folks!

That’s like saying you know how to do a heart transplant because you know how to turn on a heart-lung machine. That’s why software that supports “easy” testing such as HubSpot and Optimizely drives me nuts.

We shouldn’t be using tools to aid in completing something if we don’t understand the underlying logic at work, because not knowing how to properly use the tools means we could be drawing false conclusions. While experiments such as split-testing might be the ultimate truth tellers, they can also lead us astray in surprising ways if we don’t understand what we’re doing.

The conclusions marketers reach can be fantastically sensitive to the way the test is designed. For example, in HubSpot’s A/B testing guide they show how you can isolate your test to a single variable, let’s say a green button vs. a blue button, and then set your A/B emails’ distribution up. You’ll have some help determining the size of your sample group using a slider. It’ll let you do a 50/50 A/B test of any sample size — although all other sample splits require a list of at least 1,000 recipients.

But it won’t remind you that it was just St. Patrick’s Day and so perhaps your customers are more excited about green. And, it won’t remind you that perhaps the folks who open your emails earlier than others behave differently than those who open later (which could have to do with their location, or their OCD-ness, or their email provider, or…). And it won’t remind you that you just sent an email yesterday with a green button to the same audience and so have perhaps primed these users.

Beginning to see the problem?

Let’s dive into a few more examples showing why any testing – including marketing split testing – is more complex than you might realize.

Don’t make these common split testing mistakes:

1. Test design, such as time period, can radically alter results.

According to The New Yorker‘s review of Experiments on Trial, at the gym chain 24 Hour Fitness a team of behavioral scientists wanted to see how they might persuade people to exercise more. Over 28 days, 52 interventions were tested, such as text reminders and supportive videos. All were successful at helping people increase their attendance.

If the scientists left the study there, it might have appeared as though they’d found a multitude of ways to get us all into shape. But the scientists knew better. They followed up with the participants of the study beyond the initial period, and discovered that, in fact, none of the interventions produced any lasting change.

Did you run your last marketing test around the new year, and was it influenced by seasonality? Did you run your last test for only one email? Did you run your last test for only a week, perhaps the week your members were feeling happy (or sad) because of Valentine’s Day?

When you tweak the question just slightly, or adjust the time frame of the test, your answer might come out differently too.

2. Test participants can radically alter results.

The extreme sensitivity of experiments extends to the selection of participants. Even a slight imbalance can wildly throw off the conclusions. For example, women in a hormone-replacement therapy study also had a higher socioeconomic status than those not provided the therapy, and as a result it could not be determined if the therapy itself was yielding the benefit.

Did you use test participants that have never interacted with your company before? Did you use test participants of the right age range? Did you use test participants that just opened three of your other emails?

When you tweak the audience just slightly, your answer might come out differently too.

3. Test context can radically alter results.

The fact that an intervention has proven to work in one setting doesn’t guarantee that it will work in another. For example in Tamil Nadu, a state in southern India that had a serious problem with infant mortality, babies were being born malnourished. Mothers were eating less, as they were worried about the dangers of giving birth to a large baby. Aid agencies tested a program that offered mothers reassurance about advances in maternity care that led malnutrition to go way down. However, when they ran the same program in Bangladesh to solve the same problem it made no impact because the mother-in-law controls the food in the family there.

Did you apply your test results to other landing pages the majority of the audience might navigate to a different way? Did you apply your test results to emails with entirely different content? Did you apply your test results to ads on every platform?

When you tweak the context just slightly, your answer might come out differently too.

In short, when test results arrive without a sound theory of what causes them, we can easily overgeneralize or make erroneous influences.

Angus Deaton, a Nobel laureate in economics, has argued that any experiment that has been constrained enough to be scientifically rigorous might be too narrow to provide useful guidance for large-scale interventions. While it’s tempting to look for laws of people the way we look for the laws of gravity, science is hard, people are complex, and generalizing can be problematic.

For many of the examples I’m seeing marketing testing used for, we’d be better off for marketers to just operate based on assumptions and guesses, informed by previous results, acknowledged best practices, and intuition. It would save time in that we’d not be spending time setting up a structured test, and at least we would be being honest with ourselves about what we’re actually using to inform our decisions.

Just switch the buttons to the color that probably makes sense. Just use the subject line you think will perform best. Just use the copy that highlights the benefits.

However, if you insist upon running A/B tests, whether you use software to assist you or not, here are a few considerations to put into place:

1. Up front, always state the exact result you expect to see. Even specify the amount of the lift.

2. Know your p-value to ensure the test is statistically significant.

The p-value is the probability of obtaining results as extreme as the observed results of your test, assuming that the null hypothesis is correct. The p-value provides the smallest level of significance at which the null hypothesis would be rejected.

A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis. One commonly used significance level is 0.05. So in that case, if you find that the p-value is less than 0.05, then there is evidence against the null hypothesis.

3. Ensure you’re using testers that are actually similar to your customers or leads.

4. Put an emphasis on verifying claims by always have a second set of eyes, not invested in the results, look at the test.

5. Always reproduce test results multiple times. (Twice at a minimum.)

6. Bring a healthy dose of skepticism, and question what else – make a list of at least three things – could have led to the result you saw.

But really, and most importantly, don’t A/B test the little stuff. That stuff you’d actually need to be extra rigorous about when designing your test, and you likely don’t have the right resources to do it.

When you absolutely need to run an A/B test on something, work with the data scientist on your team. Not just a software.

Would it be too crazy to ask you to please send a $5 tip to my Venmo tip jar if you learned something new? @megsterr.

Or to my Paypal:

Thank you so much! Up next, check out seven brilliant marketing tactics used by Banksy that you can steal for your business.

By Megan Mitzel

I'm the wearer of overalls behind the marketing advice website Marketing Overalls. I'm also a senior marketing director with more than ten years of experience leading acquisition and lifecycle marketing at successful startups. Before that, I got a business degree at UNC-Chapel Hill. Before that, I owned a seashell shop. And that's the tea on me.