Do A/B tests – because correlation does not imply causation

Let’s imagine we (hypothetical) collect habits and health data from people. We find that, on average, people who regularly go to the sauna had less sick days last year.

(totally hypothetical data)

           | number of people | avg sick days / person |
|----------|------------------|------------------------|
| sauna    |             1000 |                      5 |
| no-sauna |            90000 |                     10 |

We now might want to conclude that regularly having a sauna prevents one from getting sick, and we start to recommend doing this. But this would be a big mistake!

We don’t know at all if the sauna is the cause for being less sick. Some other factor could be making people use the sauna and be less sick.

                 |====> sauna
other_factor ====|
                 |====> being less sick

This other factor could be (for example) socioeconomic status, which simply means: Wealthy people are less sick in general, and poor people just don’t go to the sauna that much.

The only productive thing we can do with this sauna-less-sick correlation that we found, is to conduct an experiment, i.e., an A/B test. We need to randomly assign (a bunch of) people into two test groups. One group will be compelled to regularly go to the sauna and the other group is forbidden to do so. It’s important, that people are not allowed to choose their group. We then let them do their assigned task for some time (quite long in that case), and then we count sick days again. Possible result:

from hypothetical A/B test

           | number of people | avg sick days / person |
|----------|------------------|------------------------|
| sauna    |               50 |                      8 |
| no-sauna |               50 |                      7 |

Oh, so sauna does not cause less sick days at all. On the contrary, the sauna-test group had one sick day more per year on average compared to the no-sauna group. Sauna might even cause a bit more sickness. (Remember, this is just a contrived example, and is not related to any real numbers evaluating the effect of sauna.)

When doing such an experiment (A/B test) we are not allowed to separate the two groups in time or space, meaning:

  • Both groups have to be measured simultaneously. If we test the sauna-sick days in one year, and the no-sauna-test-sick days the next year, other non-related factors, will have changed.
  • The groups may not be separated in space, i.e., move one group to Finland, but not the other one. The individuals have to stay locally mixed as they were before the random assignment. If we don’t adhere to one of these rules, we will measure effects that are caused by some external factors instead of the suggested cause we want to measure (sauna).

So when we find a positive correlation between two things X and Y.

X <----> Y

It can mean one causes the other:

X ====> Y

or

X <==== Y

But it does not have to. Other factors might always be involved:

      |====> X
Z ====|
      |====> Y

And we can only find the true causation with randomized controlled trials (A/B tests).


Examples showing the absurdity:

  1. X might be “Kids watching more TV”, and Y “Kids being more violent”. Watching TV could cause violence or peacefulness. We don’t know. Maybe violent kids just tend to watch more TV.

  2. “stork population” by region strongly correlates with “human birthrate” by region. However, storks don’t deliver babies. People in rural areas (storks don’t live in big cities) just have more kids.

  3. Countries with more ice-cream consumption (X) have more deaths by drowning (Y). But eating ice cream does not make you drown. It’s just that in colder countries, people eat less ice cream and don’t go swimming that much. Because it’s cold, duh.

In the media, one can find an enormous number of instances of this exact causal fallacy.

Regarding the last example, i.e., the one with the ice-cream, one might think, that one just has to filter the data in better way, e.g., by considering only the people that actually went swimming, and then compare those that ate ice-cream during the 30 minutes before going into the water with those who did not. In case the ice-cream eaters would on average drown more often, would this mean eating ice-cream increases the risk or drowning? No, it would not! There can still be endlessly many confounding factors. For example, people who take care of their fitness level might be less likely to eat ice-cream and also be better swimmers. Or it could be that kids tend to eat ice-cream more often compared to adults and that sadly, they also are more likely to drown. One does not, and can not know about (and exclude) all these possibilities. The only realistic chance we have to really find out if ice-cream causes drowning is, again, to conduct a randomized trial, i.e., randomly assign a lot of people to an ice-cream group and a placebo-ice-cream group, throw both into the water, and measure the drowning rate.


An example related to web/app development:

Let’s say we are developing a mobile (freemium) game, we might find a positive correlation between “actually playing the tutorial level instead of skipping it” and “buying the full version”.

Playing the tutorial level might cause more purchases, but we can not possibly know this from those statistics. It might be the case that people who are the “buying type of person” for some reason just also (on average) skip tutorials less often.

                 |====> using the tutorial
other_factor ====|
                 |====> purchasing

So to really find out if we can make more of our users buy the full version by making playing the tutorial level mandatory, we need to conduct an A/B test with users randomly assigned into two groups. One gets the usual version of the app (default group), the other one gets the force-tutorial version (feature group). After collecting enough data by letting the test run long enough, we can make an informed decision on if we want to fully roll out the feature.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top