Understanding Causation in Data and Decision-Making

TL;DR

Correlation tells you that two things move together. Causation tells you why. This article explores the difference between the two, introduces classical and modern methods for identifying true causes, and highlights the cognitive biases that lead us astray when we confuse the two.

Who Should Read This?

For anyone who works with data, makes decisions based on numbers, or simply wants to think more clearly about the world.

Abstract black and white graphic featuring a multimodal model pattern with various shapes. Pexels

Most Decisions Are Based on the Wrong Question

Every day, businesses make decisions based on patterns they spot in their data. A marketing team sees that users who receive a discount email go on to make a purchase. A product team notices that users who engage with a new feature have higher retention. A health researcher finds that people who drink coffee live longer.

These are all correlations. And correlations are useful. But they are not the full story.

The real question is not "do these two things move together?" The real question is "does one cause the other?" Getting that wrong can lead to wasted budgets, failed product launches, and flawed policies. Understanding causation is one of the most valuable skills a data-driven person can develop.

What Is Correlation, and Why Is It Not Enough?

Correlation means that two variables tend to change together. When one goes up, the other tends to go up (or down). It is a statistical relationship, and it is genuinely useful for spotting patterns and generating hypotheses.

But correlation does not tell you why two things move together. There are three possibilities when you see a correlation:

A causes B
B causes A
A third factor, C, causes both A and B

The third option is far more common than people realise. This is what makes correlation a starting point, not a conclusion.

Inspiration

Much of the thinking in this article draws on Data Loom by Stephen Few, a book about how to make sense of data. Few argues that poor causal reasoning is one of the most widespread and damaging errors in data analysis. He identifies specific patterns of causal thinking that trip people up repeatedly, from everyday decisions to high-stakes business choices. The book is a sharp, practical read for anyone who wants to move beyond surface-level data interpretation.

How Science Approaches Causation

Scientists have long grappled with the challenge of proving causation. The gold standard is the randomised controlled trial (RCT), better known in product and business contexts as an A/B test. The idea is simple: randomly assign subjects to two groups, change one thing for one group, and observe the difference. Because the assignment is random, any difference in outcomes can be attributed to the change you made.

But running a controlled experiment is not always possible. You cannot randomly assign countries to different economic policies. You cannot randomly give half your customers a price increase to see what happens. This is where other methods come in.

Before we get to modern statistical methods, it is worth going back to the 19th century, when philosopher John Stuart Mill laid out a remarkably clear framework for identifying causes.

Mill's Methods: A Classic Framework for Finding Causes

John Stuart Mill proposed five methods for identifying causal relationships. They are still highly relevant today, both as analytical tools and as a way of thinking.

Method of Agreement: Look at all the cases where the effect occurs and find the single factor they have in common. If everyone who got food poisoning at a dinner party ate the same dish, that dish is your candidate cause.
Method of Difference: Compare one case where the effect occurred with one where it did not. If everything is identical except one factor, that factor is likely the cause. This is essentially the logic behind A/B testing.
Joint Method: Combine the two methods above. The cause is present in every case where the effect occurs, and absent in every case where it does not. This gives you stronger evidence than either method alone.
Method of Concomitant Variations: If the magnitude of the effect changes in proportion to the magnitude of the cause, that is evidence of a causal link. The more someone smokes, the higher their risk of lung cancer. This dose-response relationship is a classic causal signal.
Method of Residues: Once you have accounted for all known causes and their effects, whatever is left unexplained in the effect must be attributed to whatever is left unexplained in the cause. This method is often used in scientific discovery, where a new phenomenon is identified by subtracting everything already known.

Visual representation of geometric calculations comparing bits and qubits in black and white. Pexels

Modern Causal Inference Methods in Business

Mill's methods are powerful for structured reasoning, but modern data science has developed more sophisticated statistical tools for business settings. Here is a practical overview of the most widely used methods.

A/B Testing (Randomised Controlled Trial): The most reliable method. Users are randomly assigned to a control group or a treatment group. One group experiences the change; the other does not. Any difference in outcomes is attributable to the change. This is ideal for new feature rollouts, pricing experiments, and UI changes.

Difference-in-Differences (DiD): This method compares how outcomes change over time for a treated group versus an untreated group. Because you are looking at change, pre-existing differences between the groups are controlled for. This is useful when randomisation is not possible, such as when a policy change is rolled out across specific cities or regions.

Propensity Score Matching: This technique finds pairs of users who are similar in every measurable way, except that one received the treatment and one did not. By comparing these matched pairs, you reduce the effect of confounding variables. This is useful when you are working with observational data and no experiment was run.

Instrumental Variables: This is one of the more technically demanding methods. It involves finding a third variable, called an instrument, that influences whether someone receives the treatment but has no direct effect on the outcome. This allows you to isolate the causal effect of the treatment even when there are unmeasurable confounders.

Synthetic Control: This method constructs a "synthetic" version of a treated unit by combining data from similar untreated units. It is particularly useful for evaluating the impact of major policy changes or market-level interventions, where you cannot run an experiment and there is no natural comparison group.

The Biases That Mislead Us

Stephen Few identifies several recurring patterns of causal error in Data Loom. Each one is easy to fall into and harder to spot than you would expect.

Post Hoc Ergo Propter Hoc: This Latin phrase means "after this, therefore because of this." It is the assumption that because one event followed another, the first event caused the second. A company launches a new campaign, and sales go up the following month. Did the campaign cause the increase? Maybe. Or maybe it was seasonality. Or a competitor's product recall. Or an unrelated news event. The sequence alone proves nothing.

Spurious Correlation: Two variables can be highly correlated without having any direct relationship. Often, both are being driven by a third, unobserved variable. The statistician Tyler Vigen has documented dozens of absurd examples: the number of films Nicolas Cage appeared in correlates strongly with the number of people who drowned in swimming pools in the United States. No one seriously argues that one caused the other. But in messier, more plausible-looking data, these spurious correlations are much harder to dismiss.

Unit Bias: This is the tendency to attribute a complex outcome to a single, prominent factor, rather than considering all the contributing variables. When a product fails, it is tempting to blame the most recent change. When a business succeeds, it is tempting to credit the CEO. Reality is almost always more distributed than that. Unit bias is partly a cognitive shortcut; our brains are not designed to hold many interacting variables in mind at once.

Outcome Bias: This is the tendency to judge a decision by its result, rather than by the quality of the reasoning behind it. If a product launch goes well, the team is praised for their brilliant strategy. If it fails, the same strategy is called reckless. The outcome is allowed to retroactively colour our assessment of the process. This makes it very hard to learn from experience, because our evaluations are contaminated by luck.

Ask Why, Not Just What

Data is extraordinarily good at showing you what is happening. It is far less reliable at telling you why. That gap is where bad decisions live.

The tools exist to close that gap: Mill's methods give you a structured way to reason about causes. Modern causal inference methods give you statistical techniques to test those causes rigorously. And awareness of cognitive biases gives you a defence against the patterns of error that undermine even careful analysis.

The next time you see a striking correlation in your data, pause before drawing a causal conclusion. Ask whether a third variable could explain both. Ask whether the timing alone is doing the causal work. Ask whether you are focusing on one factor because it is the most salient, rather than the most important. Ask whether the outcome is shaping how you are interpreting the process.

Correlation tells you where to look. Causation tells you what to do. Getting from one to the other requires method, rigour, and a healthy scepticism towards your own experience and instincts.

References: Data Loom, Stephen Few. Mill's System of Logic, John Stuart Mill.