Intro to Incremental Measurement: Statistical Significance

Let’s get a little nerdy

Data Based Marketing
8 min readJan 4, 2021

I know we started off with the flash and allure that comes with controlled experimentation in a marketing environment, but it’s important to understand if the results of your experiment are likely to be replicated. For that we have to turn to statistics, more specifically a test for statistical significance. Before your eyes glaze over I’ll skip the walk through on an exact formula and calculations. If you want a more technical walk through I’ll include some additional information below as well as a couple of links. There are a few types of significance tests and it is entirely possible to pick the wrong one for your situation, but the calculations themselves aren’t where I see issues in marketing departments. More often then not the problem is around the edges, either with the interpretation of the results or in the setting of parameters. So instead of a walk through on crunching the numbers let’s work on addressing those problems.

Understanding what is actually being done with a significance test will allow proper interpretation of its result. Statistical significance is not a pass/fail for the efficacy of a marketing action as I’ve often seen it explained. It’s also not a simple thumbs up or down to having “good” results. To better explain this I’ll start with some basic table setting for Statistics in general. Statistics is a science, well established observations have gone into the the highly useful tables and equations we reference to make sense of data. That said, it’s not as simple to interpret as what I will call “pure mathematics”, like algebra or geometry. These more pure mathematics numerically represent truths that can be proven out. Statistical calculations often help us make sense of and quantify uncertainty the best we can based on those historic observations and the formulae and tables derived from them. Statistics can’t tell you if your sample of 10 people properly represents the US population for a given metric without knowledge of that metric for the population. It can help us understand the probability of such a small random sample giving us misrepresentative results.

Statistical significance as applied to our treat/control marketing experiments helps us understand how certain we can be that an observed difference in behavior between two groups is not due to an unrepresentative sample (how these two groups were selected) as it relates to a certain threshold of certainty. Let’s consider the plant growth experiment we discussed previously. There are two basic reasons the growth between the plants in our controlled experiment could be different. The first is the plants themselves that ended up in our different groups. Not for inherently bad sampling methods, but because anytime we take a sample, it may be unrepresentative of the population it was taken from. The second is our actual treatment difference between the two groups. Significance testing helps us measure how likely it is that the first reason is responsible for the difference in growth measurement.

Lets look at the most common way I see a significance tests misinterpreted. I often hear or read something like this:

“The difference between our treated group and the control group isn’t statistically significant, so the treatment wasn’t effective.”

This happens for one of two reasons. One is a pure misunderstanding of how the test works and what its purpose is. The other, and more common reason, is an analyst that has far overly reduced their explanation trying to make it simple enough to understand. Simplification is a poor excuse to make inaccurate statements. Trust me your non-technical stakeholders are more than smart enough to understand a statistical calculation when its boiled down to its essential parts and free of jargon. A better explanation would be something like this:

“The difference in behavior between the customers given our marketing treatment and those who didn’t receive it doesn’t show significance at our desired confidence level. This means we can’t be confident the difference between the two groups is because of the marketing treatment and not a result of the sampling.”

If that’s too long for your stakeholders lose the first sentence, but do not pretend your significance test provides a certainty (or even an indication) that it doesn’t. It’s also important to understand that a statistical significance test doesn’t look to prove the negative. Lack of confidence at a certain threshold that a treatment increased the rate of a certain response behavior doesn’t mean there is confidence it doesn’t do so.

Two more quick notes before we move on:

  1. The difference between the two measured groups, is the true difference between those two measured groups. Assuming proper measurement of that behavior, it can’t really be disputed. Lack of statistical significance does not to bring that reality into question, only whether you are likely to see the same directional results given the population.
  2. Statistical Significance at your desired confidence doesn’t confirm that your exact incremental measurement is what would have occurred if you could truly measure both treat and control response rates across the entire population. The best it can do is give you confidence there is a difference and in the direction of that difference. Additional calculations can give you a range relative to confidence.

It’s easy to understand why the first issue is so common. Significance testing isn’t quite as straight forward as we’d like, and like most subjects a deeper understanding actually makes things less black and white. Falsely removing uncertainty isn’t going to make our decisions any better though. Good marketers navigate uncertainty to make the best decisions possible with the information they have or even decide to do nothing and gather more information.

This brings us to the other common issue I see when organizations implement significance testing, setting their confidence thresholds carelessly. Without blinking an eye most data scientists and analysts I’ve worked with will set their confidence thresholds at far too high a level. Most commonly 95% and occasionally even at 99%. These are holdovers from scientific study, where p-values of .05 and .01 are common defaults, especially in educational application. Marketing isn’t the application the statisticians had in mind when developing these calculations and it’s not what most professors have in mind when teaching them. We’re aren’t used to thinking of significance testing against much lower confidence requirements.

Ask yourself if you really need as much confidence in the efficacy of a marketing campaign as you would for a new pharmaceutical in order to continue its use. The risks in nearly all marketing applications are far less asymmetric than in classical scientific applications. Non transferable marketing campaign results aren’t going to cost human lives. How much confidence do you need that what your are doing is creating a positive ROI to continue to do it? How confident do you need to be that its costing you money to stop? My guess is its closer to 50/50 than the 90% plus you may be using. My default suggestion for a marketing campaign is 70%, but every application and organization requires different risk tolerances. There is no perfect answer for where to set your minimum confidence threshold in order to conclude an A/B test or consider a marketing strategy a success and continue its operation, but don’t fail to make it a thoughtful decision. Consider your risk scenario before adopting any default thresholds.

Looking at an equation for statistical significance may be intimidating at times, but when you get down to it its just another tool for helping quantify uncertainty. Whether you’re doing the calculation yourself or you have an analytics team doing it for you I hope you have a little better understanding of how to use the results of a significance test. For those of you who want to dive just a little deeper I’ve included below some additional notes on what is considered in the process of testing for significance with a Two Proportion Z Test. I’ll also include some links to resources that may be helpful.

For those of you ready be done with the statistics talk, I’ll be wrapping up this introductory series with some thoughts on turning your incremental measurements into financial metrics like ROI (linked below). If you’ve gotten some value out of this series please share it with your friends. If you find marketing analytics, measurement, and optimization useful or interesting I’ll be putting out new posts on a variety of related subjects throughout the year, so be sure to follow.

Two Proportion Z Test

Regarding incremental marketing measurement, and most more general A/B marketing related tests, the appropriate test for significance will be a Two Proportion Z Test. The purpose of this measurement is to compare two randomly selected samples and identify within a confidence threshold if the difference in the these two groups proportions indicates a difference given the full population. For our marketing measurement these proportions will be some form of response rate to a marketing treatment vs the response rate to no action (or a different action for an A/B treatment test). This test is performed by calculating a Z statistic/score for your data based on the below formula, then converting that Z value to is corresponding probability and comparing it to your confidence requirement.

In the calculation the difference in proportions is factored into a calculation along with the combined standard sampling error of the two groups derived by the size of those samples and overall proportion. Sampling standard error is based on thorough study of random sampling and basically explains how much your random sample is to be skewed from a population based on the size of the sample. Larger samples are less likely to misrepresent the behavior of the population.

Considering both the difference in proportions as well as the standard sampling error is why there is no specific answer to the question: “How big of a control group do we need to get a significant result?” If you treatment group responds at 90% vs the control group at 10% small volume may be enough to achieve significance at your desired confidence threshold. If there is only a 0.1% difference between the two its going to take much more volume to have confidence that sampling error is unlikely to be the cause. When trying to decide on needed control group size the best you can do is make an estimate based on an expectation of differential in response rate based on similar past activity.

Useful Links:

Check out the next piece in this series:

Share with friends and colleagues who might enjoy it. Follow me here on LinkedIn so you don’t miss any future posts.

--

--

Data Based Marketing

Elevating sales and marketing through data & analytics: reporting, measurement, optimization, personalization, lead generation …