randomized experiment causality

Causal Inference for the Brave and True

02 - randomised experiments, 02 - randomised experiments #, the golden standard #.

In the previous session, we saw why and how association is different from causation. We also saw what is required to make association be causation.

\( E[Y|T=1] - E[Y|T=0] = \underbrace{E[Y_1 - Y_0|T=1]}_{ATT} + \underbrace{\{ E[Y_0|T=1] - E[Y_0|T=0] \}}_{BIAS} \)

To recap, association becomes causation if there is no bias. There will be no bias if \(E[Y_0|T=0]=E[Y_0|T=1]\) . In words, association will be causation if the treated and control are equal or comparable, except for their treatment. Or, in more technical words, when the outcome of the untreated is equal to the counterfactual outcome of the treated. Remember that this counterfactual outcome is the outcome of the treated group if they had not received the treatment.

I think we did an OK job explaining how to make association equal to causation in math terms. But that was only in theory. Now, we look at the first tool we have to make the bias vanish: Randomised Experiments . Randomised experiments randomly assign individuals in a population to a treatment or to a control group. The proportion that receives the treatment doesn’t have to be 50%. You could have an experiment where only 10% of your samples get the treatment.

Randomisation annihilates bias by making the potential outcomes independent of the treatment.

\( (Y_0, Y_1) \perp\!\!\!\perp T \)

This can be confusing at first (it was for me). But don’t worry, my brave and true fellow, I’ll explain it further. If the outcome is independent of the treatment, doesn’t this also imply that the treatment has no effect? Well, yes! But notice I’m not talking about the outcomes. Instead, I’m talking about the potential outcomes. The potential outcome is how the outcome would have been under treatment ( \(Y_1\) ) or under control ( \(Y_0\) ). In randomized trials, we don’t want the outcome to be independent of the treatment since we think the treatment causes the outcome. But we want the potential outcomes to be independent of the treatment.

Saying that the potential outcomes are independent of the treatment is saying that they would be, in expectation, the same in the treatment or the control group. In simpler terms, it means that treatment and control groups are comparable. Or that knowing the treatment assignment doesn’t give me any information on how the outcome was previous to the treatment. Consequently, \((Y_0, Y_1)\perp T\) means that the treatment is the only thing generating a difference between the outcome in the treated and in the control group. To see this, notice that independence implies precisely that

\( E[Y_0|T=0]=E[Y_0|T=1]=E[Y_0] \)

Which, as we’ve seen, makes it so that

\( E[Y|T=1] - E[Y|T=0] = E[Y_1 - Y_0]=ATE \)

So, randomization gives us a way to use a simple difference in means between treatment and control and call that the treatment effect.

In a School Far, Far Away #

In 2020, the Coronavirus Pandemic forced businesses to adapt to social distancing. Delivery services became widespread, and big corporations shifted to a remote work strategy. With schools, it wasn’t different. Many started their own online repository of classes.

Four months into the crisis, many wonder if the introduced changes could be maintained. There is no question that online learning has its benefits. It is cheaper since it can save on real estate and transportation. It can also be more digital, leveraging world-class content from around the globe, not just from a fixed set of teachers. Despite all of that, we still need to answer if online learning has a negative or positive impact on the student’s academic performance.

One way to answer this is to take students from schools that give mostly online classes and compare them with students from schools that provide lectures in traditional classrooms. As we know by now, this is not the best approach. It could be that online schools attract only the well-disciplined students that do better than average even if the class were presential. In this case, we would have a positive bias, where the treated are academically better than the untreated: \(E[Y_0|T=1] > E[Y_0|T=0]\) .

Or on the flip side, it could be that online classes are cheaper and are composed chiefly of less wealthy students, who might have to work besides studying. In this case, these students would do worse than those from the presential schools even if they took presential classes. If this was the case, we would have a bias in the other direction, where the treated are academically worse than the untreated: \(E[Y_0|T=1] < E[Y_0|T=0]\) .

So, although we could make simple comparisons, it wouldn’t be compelling. One way or another, we could never be sure if there wasn’t any bias lurking around and masking our causal effect.

To solve that, we need to make the treated and untreated comparable \(E[Y_0|T=1] = E[Y_0|T=0]\) . One way to force this is by randomly assigning the online and presential classes to students. If we managed to do that, the treatment and untreated would be, on average, the same, except for the treatment they receive.

Fortunately, some economists have done that for us. They’ve randomized classes so that some students were assigned to have face-to-face lectures, others to have only online lessons, and a third group to have a blended format of both online and face-to-face classes. They collected data on a standard exam at the end of the semester.

Here is what the data looks like:

	gender	white	format_ol	format_blended	falsexam
0	0	1.0	0	0.0	63.29997
1	1	1.0	0	0.0	79.96000
2	1	1.0	0	1.0	83.37000
3	1	1.0	0	1.0	90.01994
4	1	1.0	1	0.0	83.30000

We can see that we have 323 samples. It’s not exactly big data, but something we can work with. To estimate the causal effect, we can simply compute the mean score for each of the treatment groups.

	gender	asian	black	hawaiian	hispanic	unknown	white	format_ol	format_blended	falsexam
class_format
blended	0.550459	0.217949	0.102564	0.025641	0.012821	0.012821	0.628205	0.0	1.0	77.093731
face_to_face	0.633333	0.202020	0.070707	0.000000	0.010101	0.000000	0.717172	0.0	0.0	78.547485
online	0.542553	0.228571	0.028571	0.014286	0.028571	0.000000	0.700000	1.0	0.0	73.635263

Yup. It’s that simple. We can see that face-to-face classes yield a 78.54 average score, while online courses yield a 73.63 average score. Not so good news for the proponents of online learning. The \(ATE\) for an online class is thus -4.91. This means that online classes cause students to perform about 5 points lower , on average. That’s it. You don’t need to worry that online courses might have poorer students that can’t afford face-to-face classes or, for that matter, you don’t have to worry that the students from the different treatments are different in any way other than the treatment they received. By design, the random experiment is made to wipe out those differences.

For this reason, a good sanity check to see if the randomisation was done right (or if you are looking at the correct data) is to check if the treated are equal to the untreated in pre-treatment variables. Our data has information on gender and ethnicity to see if they are similar across groups. We can say that they look pretty similar for the gender , asian , hispanic , and white variables. The black variable, however, seems a little bit different. This draws attention to what happens with a small dataset. Even under randomisation, it could be that, by chance, one group is different from another. In large samples, this difference tends to disappear.

The Ideal Experiment #

Randomised experiments or Randomised Controlled Trials (RCT) are the most reliable way to get causal effects. It’s a straightforward technique and absurdly convincing. It is so powerful that most countries have it as a requirement for showing the effectiveness of new medicine. To make a terrible analogy, you can think of RCT as Aang, from Avatar: The Last Airbender, while other techniques are more like Sokka. Sokka is cool and can pull some neat tricks here and there, but Aang can bend the four elements and connect with the spiritual world. Think of it this way, if we could, RCT would be all we would ever do to uncover causality. A well designed RCT is the dream of any scientist.

Unfortunately, they tend to be either very expensive or just plain unethical. Sometimes, we simply can’t control the assignment mechanism. Imagine yourself as a physician trying to estimate the effect of smoking during pregnancy on baby weight at birth. You can’t simply force a random portion of moms to smoke during pregnancy. Or say you work for a big bank, and you need to estimate the impact of the credit line on customer churn. It would be too expensive to give random credit lines to your customers. Or that you want to understand the impact of increasing the minimum wage on unemployment. You can’t simply assign countries to have one or another minimum wage. You get the point.

We will later see how to lower the randomisation cost by using conditional randomisation, but there is nothing we can do about unethical or unfeasible experiments. Still, whenever we deal with causal questions, it is worth thinking about the ideal experiment . Always ask yourself, if you could, what would be the perfect experiment you would run to uncover this causal effect? This tends to shed some light on the way how we can discover the causal effect even without the ideal experiment.

The Assignment Mechanism #

In a randomised experiment, the mechanism that assigns units to one treatment or the other is, well, random. As we will see later, all causal inference techniques will somehow try to identify the assignment mechanisms of the treatments. When we know for sure how this mechanism behaves, causal inference will be much more confident, even if the assignment mechanism isn’t random.

Unfortunately, the assignment mechanism can’t be discovered by simply looking at the data. For example, if you have a dataset where higher education correlates with wealth, you can’t know for sure which one caused which by just looking at the data. You will have to use your knowledge about how the world works to argue in favor of a plausible assignment mechanism: is it the case that schools educate people, making them more productive and leading them to higher-paying jobs. Or, if you are pessimistic about education, you can say that schools do nothing to increase productivity, and this is just a spurious correlation because only wealthy families can afford to have a kid get a higher degree.

In causal questions, we usually can argue in both ways: that X causes Y, or that it is a third variable Z that causes both X and Y, and hence the X and Y correlation is just spurious. For this reason, knowing the assignment mechanism leads to a much more convincing causal answer. This is also what makes causal inference so exciting. While applied ML is usually just pressing some buttons in the proper order, applied causal inference requires you to seriously think about the mechanism generating that data.

Key Ideas #

We looked at how randomised experiments are the simplest and most effective way to uncover causal impact. It does this by making the treatment and control groups comparable. Unfortunately, we can’t do randomised experiments all the time, but it is still helpful to think about what is the ideal experiment we would do if we could.

Someone familiar with statistics might be protesting right now that I didn’t look at the variance of my causal effect estimate. How can I know that a 4.91 points decrease is not due to chance? In other words, how can I know if the difference is statistically significant? And they would be right. Don’t worry. I intend to review some statistical concepts next.

References #

I like to think of this entire book as a tribute to Joshua Angrist, Alberto Abadie and Christopher Walters for their amazing Econometrics class. Most of the ideas here are taken from their classes at the American Economic Association. Watching them is what is keeping me sane during this tough year of 2020.

Cross-Section Econometrics

Mastering Mostly Harmless Econometrics

I’ll also like to reference the amazing books from Angrist. They have shown me that Econometrics, or ‘Metrics as they call it, is not only extremely useful but also profoundly fun.

Mostly Harmless Econometrics

Mastering ‘Metrics

My final reference is Miguel Hernan and Jamie Robins’ book. It has been my trustworthy companion in the most thorny causal questions I had to answer.

Causal Inference Book

The data used here is from a study of Alpert, William T., Kenneth A. Couch, and Oskar R. Harmon. 2016. “A Randomized Assessment of Online Learning” . American Economic Review, 106 (5): 378-82.

Contribute #

Causal Inference for the Brave and True is an open-source material on causal inference, the statistics of science. Its goal is to be accessible monetarily and intellectually. It uses only free software based on Python. If you found this book valuable and want to support it, please go to Patreon . If you are not ready to contribute financially, you can also help by fixing typos, suggesting edits, or giving feedback on passages you didn’t understand. Go to the book’s repository and open an issue . Finally, if you liked this content, please share it with others who might find it helpful and give it a star on GitHub .

Introduction to Statistics and Data Science

Chapter 7 Randomization and Causality

In this chapter we kick off the third segment of this book: statistical theory. Up until this point, we have focused only on descriptive statistics and exploring the data we have in hand. Very often the data available to us is observational data – data that is collected via a survey in which nothing is manipulated or via a log of data (e.g., scraped from the web). As a result, any relationship we observe is limited to our specific sample of data, and the relationships are considered associational . In this chapter we introduce the idea of making inferences through a discussion of causality and randomization .

Needed Packages

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 1.3 for information on how to install and load R packages.

7.1 Causal Questions

What if we wanted to understand not just if X is associated with Y, but if X causes Y? Examples of causal questions include:

Does smoking cause cancer ?
Do after school programs improve student test scores ?
Does exercise make people happier ?
Does exposure to abstinence only education lead to lower pregnancy rates ?
Does breastfeeding increase baby IQs ?

Importantly, note that while these are all causal questions, they do not all directly use the word cause . Other words that imply causality include:

Increase / decrease

In general, the tell-tale sign that a question is causal is if the analysis is used to make an argument for changing a procedure, policy, or practice.

7.2 Randomized experiments

The gold standard for understanding causality is the randomized experiment . For the sake of this chapter, we will focus on experiments in which people are randomized to one of two conditions: treatment or control. Note, however, that this is just one scenario; for example, schools, offices, countries, states, households, animals, cars, etc. can all be randomized as well, and can be randomized to more than two conditions.

What do we mean by random? Be careful here, as the word “random” is used colloquially differently than it is statistically. When we use the word random in this context, we mean:

Every person (or unit) has some chance (i.e., a non-zero probability) of being selected into the treatment or control group.
The selection is based upon a random process (e.g., names out of a hat, a random number generator, rolls of dice, etc.)

In practice, a randomized experiment involves several steps.

Half of the sample of people is randomly assigned to the treatment group (T), and the other half is assigned to the control group (C).
Those in the treatment group receive a treatment (e.g., a drug) and those in the control group receive something else (e.g., business as usual, a placebo).
Outcomes (Y) in the two groups are observed for all people.
The effect of the treatment is calculated using a simple regression model, \[\hat{y} = b_0 + b_1T \] where \(T\) equals 1 when the individual is in the treatment group and 0 when they are in the control group. Note that using the notation introduced in Section 5.2.2 , this would be the same as writing \(\hat{y} = b_0 + b_1\mathbb{1}_{\mbox{Trt}}(x)\) . We will stick with the \(T\) notation for now, because this is more common in randomized experiments in practice.

For this simple regression model, \(b_1 = \bar{y}_T - \bar{y}_C\) is the observed “treatment effect”, where \(\bar{y}_T\) is the average of the outcomes in the treatment group and \(\bar{y}_C\) is the average in the control group. This means that the “treatment effect” is simply the difference between the treatment and control group averages.

7.2.1 Random processes in R

There are several functions in R that mimic random processes. You have already seen one example in Chapters 5 and 6 when we used sample_n to randomly select a specifized number of rows from a dataset. The function rbernoulli() is another example, which allows us to mimic the results of a series of random coin flips. The first argument in the rbernoulli() function, n , specifies the number of trials (in this case, coin flips), and the argument p specifies the probability of “success”" for each trial. In our coin flip example, we can define “success” to be when the coin lands on heads. If we’re using a fair coin then the probability it lands on heads is 50%, so p = 0.5 .

Sometimes a random process can give results that don’t look random. For example, even though any given coin flip has a 50% chance of landing on heads, it’s possible to observe many tails in a row, just due to chance. In the example below, 10 coin flips resulted in only 3 heads, and the first 6 flips were tails. Note that TRUE corresponds to the notion of “success”, so here TRUE = heads and FALSE = tails.

Importantly, just because the results don’t look random, does not mean that the results aren’t random. If we were to repeat this random process, we will get a different set of random results.

Random processes can appear unstable, particularly if they are done only a small number of times (e.g. only 10 coin flips), but if we were to conduct the coin flip procedure thousands of times, we would expect the results to stabilize and see on average 50% heads.

Often times when running a randomized experiment in practice, you want to ensure that exactly half of your participants end up in the treatment group. In this case, you don’t want to flip a coin for each participant, because just by chance, you could end up with 63% of people in the treatment group, for example. Instead, you can imagine each participant having an ID number, which is then randomly sorted or shuffled. You could then assign the first half of the randomly sorted ID numbers to the treatment group, for example. R has many ways of mimicing this type of random assignment process as well, such as the randomizr package.

7.3 Omitted variables

In a randomized experiment, we showed in Section 7.2 that we can calculate the estimated causal effect ( \(b_1\) ) of a treatment using a simple regression model.

Why can’t we use the same model to determine causality with observational data? Recall our discussion from Section 5.3.2 . We have to be very careful not to make unwarranted causal claims from observational data, because there may be an omitted variable (Z), also known as a confounder :

Here are some examples:

There is a positive relationship between sales of ice cream (X) from street vendors and crime (Y). Does this mean that eating ice cream causes increased crime? No. The omitted variable is the season and weather (Z). That is, there is a positive relationship between warm weather (Z) and ice cream consumption (X) and between warm weather (Z) and crime (Y).
Students that play an instrument (X) have higher grades (Y) than those that do not. Does this mean that playing an instrument causes improved academic outcomes? No. Some omitted variables here could be family socio-economic status and student motivation. That is, there is a positive relationship between student motivation (and a family with resources) (Z) and likelihood of playing an instrument (X) and between motivation / resources and student grades (Y).
Countries that eat a lot of chocolate (X) also win the most Nobel Prizes (Y). Does this mean that higher chocolate consumption leads to more Nobel Prizes? No. The omitted variable here is a country’s wealth (Z). Wealthier countries win more Nobel Prizes and also consume more chocolate.

Examples of associations that are misinterpreted as causal relationships abound. To see more examples, check out this website: https://www.tylervigen.com/spurious-correlations .

7.4 The magic of randomization

If omitted variables / confounders are such a threat to determining causality in observational data, why aren’t they also a threat in randomized experiments?

The answer is simple: randomization . Because people are randomized to treatment and control groups, on average there is no difference between these two groups on any characteristics other than their treatment .

This means that before the treatment is given, on average the two groups (T and C) are equivalent to one another on every observed and unobserved variable. For example, the two groups should be similar in all pre-treatment variables: age, gender, motivation levels, heart disease, math ability, etc. Thus, when the treatment is assigned and implemented, any differences between outcomes can be attributed to the treatment .

7.4.1 Randomization Example

Let’s see the magic of randomization in action. Imagine that we have a promising new curriculum for teaching math to Kindergarteners, and we want to know whether or not the curriculum is effective. Let’s explore how a randomized experiment would help us test this. First, we’ll load in a dataset called ed_data . This data originally came from the Early Childhood Longitudinal Study (ECLS) program but has been adapted for this example. Let’s take a look at the data.

It includes information on 335 Kindergarten students: indicator variables for whether they are female or minority students, information on their parents’ highest level of education, a continuous measure of the their socio-economic status (SES), and their reading and math scores. For our purposes, we will assume that these are all pre-treatment variables that are measured on students at the beginning of the year, before we conduct our (hypothetical) randomized experiment. We also have included two variables Trt_rand and Trt_non_rand for demonstration purposes, which we will describe below.

In order to conduct our randomized experiment, we could randomly assign half of the Kindergarteners to the treatment group to recieve the new curriculum, and the other half of the students to the control group to recieve the “business as usual” curriculum. Trt_rand is the result of this random assignment , and is an indicator variable for whether the student is in the treatment group ( Trt_rand == 1 ) or the control group ( Trt_rand == 0 ). By inspecting this variable, we can see that 167 students were assigned to treatment and 168 were assigned to control.

Remember that because this treatment assignment was random , we don’t expect a student’s treatment status to be correlated with any of their other pre-treatment characteristics. In other words, students in the treatment and control groups should look approximately the same on average . Looking at the means of all the numeric variables by treatment group, we can see that this is true in our example. Note how the summarise_if function is working here; if a variable in the dataset is numeric, then it is summarized by calculating its mean .

Both the treatment and control groups appear to be approximately the same on average on the observed characteristics of gender, minority status, SES, and pre-treatment reading and math scores. Note that since FEMALE is coded as 0 - 1, the “mean” is simply the proportion of students in the dataset that are female. The same is true for MINORITY .

In our hypothetical randomized experiment, after randomizing students into the treatment and control groups, we would then implement the appropriate (new or business as usual) math curriculum throughout the school year. We would then measure student math scores again at the end of the year, and if we observed that the treatment group was scoring higher (or lower) on average than the control group, we could attribute that difference entirely to the new curriculum. We would not have to worry about other omitted variables being the cause of the difference in test scores, because randomization ensured that the two groups were equivalent on average on all pre-treatment characteristics, both observed and unobserved.

In comparison, in an observational study, the two groups are not equivalent on these pre-treatment variables. In the same example above, let us imagine where instead of being randomly assigned to treatment, instead students with lower SES are assigned to the new specialized curriculum ( Trt_non_rand = 1 ), and those with higher SES are assigned to the business as usual curriculum ( Trt_non_rand = 0 ). The indicator variable Trt_non_rand is the result of this non-random treatment group assignment process.

In this case, the table of comparisons between the two groups looks quite different:

There are somewhat large differences between the treatment and control group on several pre-treatment variables in addition to SES (e.g. % minority, and reading and math scores). Notice that the two groups still appear to be balanced in terms of gender. This is because gender is in general not associated with SES. However, minority status and test scores are both correlated with SES, so assigning treatment based on SES (instead of via a random process) results in an imbalance on those other pre-treatment variables. Therefore, if we observed differences in test scores at the end of the year, it would be difficult to disambiguate whether the differences were caused by the intervention or due to some of these other pre-treatment differences.

7.4.2 Estimating the treatment effect

Imagine that the truth about this new curriculum is that it raises student math scores by 10 points, on average. We can use R to mimic this process and randomly generate post-test scores that raise the treatment group’s math scores by 10 points on average, but leave the control group math scores largely unchanged. Note that we will never know the true treatment effect in real life - the treatment effect is what we’re trying to estimate; this is for demonstration purposes only.

We use another random process function in R, rnorm() to generate these random post-test scores. Don’t worry about understanding exactly how the code below works, just note that in both the Trt_rand and Trt_non_rand case, we are creating post-treatment math scores that increase a student’s score by 10 points on average, if they received the new curriculum.

By looking at the first 10 rows of this data in Table 7.1 , we can convince ourselves that both the MATH_post_trt_rand and MATH_post_trt_non_rand scores reflect this truth that the treatment raises test scores by 10 points, on average. For example, we see that for student 1, they were assigned to the treatment group in both scenarios and their test scores increased from about 18 to about 28. Student 2, however, was only assigned to treatment in the second scenario, and their test scores increased from about 31 to 41, but in the first scenario since they did not receive the treatment, their score stayed at about 31. Remember that here we are showing two hypothetical scenarios that could have occurred for these students - one if they were part of a randomized experiment and one where they were part of an observational study - but in real life, the study would only be conducted one way on the students and not both.

TABLE 7.1: Math scores for first 10 students, under random and non-random treatment assignment scenarios
ID	MATH_pre	Trt_rand	MATH_post_trt_rand	Trt_non_rand	MATH_post_trt_non_rand
1	18.7	1	28.4	1	28.9
2	30.6	0	31.2	1	40.8
3	31.6	0	32.2	0	31.3
4	31.4	0	32.0	1	41.6
5	24.2	1	34.0	1	34.4
6	49.8	1	59.5	0	49.5
7	27.1	0	27.7	1	37.3
8	27.4	0	27.9	1	37.5
9	25.1	1	34.9	1	35.3
10	41.9	1	51.6	0	41.6

Let’s examine how students in each group performed on the post-treatment math assessment on average in the first scenario where they were randomly assigned (i.e. using Trt_rand and MATH_post_trt_rand ).

Remember that in a randomized experiment, we calculate the treatment effect by simply taking the difference in the group averages (i.e. \(\bar{y}_T - \bar{y}_C\) ), so here our estimated treatment effect is \(49.6 - 39.6 = 10.0\) . Recall that we said this could be estimated using the simple linear regression model \(\hat{y} = b_0 + b_1T\) . We can fit this model in R to verify that our estimated treatment effect is \(b_1 = 10.0\) .

Let’s also look at the post-treatment test scores by group for the non-randomized experiment case.

Note that even though the treatment raised student scores by 10 points on average, in the observational case we estimate the treatment effect is much smaller. This is because treatment was confounded with SES and other pre-treatment variables, so we could not obtain an accurate estimate of the treatment effect.

7.5 If you know Z, what about multiple regression?

In the previous sections, we made clear that you cannot calculate the causal effect of a treatment using a simple linear regression model unless you have random assignment. What about a multiple regression model?

The answer here is more complicated. We’ll give you an overview, but note that this is a tiny sliver of an introduction and that there is an entire field of methods devoted to this problem. The field is called causal inference methods and focuses on the conditions under and methods in which you can calculate causal effects in observational studies.

Recall, we said before that in an observational study, the reason you can’t attribute causality between X and Y is because the relationship is confounded by an omitted variable Z. What if we included Z in the model (making it no longer omitted), as in:

\[\hat{y} = b_0 + b_1T + b_2Z\] As we learned in Chapter 6 , we can now interpret the coefficient \(b_1\) as the estimated effect of the treatment on outcomes, holding constant (or adjusting for) Z .

Importantly, the relationship between T and Y, adjusting for Z can be similar or different than the relationship between T and Y alone. In advance, you simply cannot know one from the other.

Let’s again look at our model fit1_non_rand that looked at the relationship between treatment and math scores, and compare it to a model that adjusts for the confounding variable SES.

The two models give quite different indications of how effective the treatment is. In the first model, the estimate of the treatment effect is 2.176, but in the second model once we control for SES, the estimate is 8.931. Again, this is because in our non-random assignment scenario, treatment status was confounded with SES.

Importantly, in the randomized experiment case, controlling for confounders using a multiple regression model is not necessary - again, because of the randomization. Let’s look at the same two models using the data from the experimental case (i.e. using Trt_rand and MATH_post_trt_rand ).

We can see that both models give estimates of the treatment effect that are roughly the same (10.044 and 9.891), regardless of whether or not we control for SES. This is because randomization ensured that the treatment and control group were balanced on all pre-treatment characteristics - including SES, so there is no need to control for them in a multiple regression model.

7.6 What if you don’t know Z?

In the observational case, if you know the process through which people are assigned to or select treatment then the above multiple regression approach can get you pretty close to the causal effect of the treatment on the outcomes. This is what happened in our fit2_non_rand model above where we knew treatment was determined by SES, and so we controlled for it in our model.

But this is rarely the case. In most studies, selection of treatment is not based on a single variable . That is, before treatment occurs, those that will ultimately receive the treatment and those that do not might differ in a myriad of ways. For example, students that play instruments may not only come from families with more resources and have higher motivation, but may also play fewer sports, already be great readers, have a natural proclivity for music, or come from a musical family. As an analyst, it is typically very difficult – if not impossible – to know how and why some people selected a treatment and others did not.

Without randomization, here is the best approach:

Remember: your goal is to approximate a random experiment. You want the two groups to be similar on any and all variables that are related to uptake of the treatment and the outcome.
Think about the treatment selection process. Why would people choose to play an instrument (or not)? Attend an after-school program (or not)? Be part of a sorority or fraternity (or not)?
Look for variables in your data that you can use in a multiple regression to control for these other possible confounders. Pay attention to how your estimate of the treatment impact changes as you add these into your model (often it will decrease).
State very clearly the assumptions you are making, the variables you have controlled for, and the possible other variables you were unable to control for. Be tentative in your conclusions and make clear their limitations – that this work is suggestive and that future research – a randomized experiment – would be more definitive.

7.7 Conclusion

In this chapter we’ve focused on the role of randomization in our ability to make inferences – here about causation. As you will see in the next few chapters, randomization is also important for making inferences from outcomes observed in a sample to their values in a population. But the importance of randomization goes even deeper than this – one could say that randomization is at the core of inferential statistics .

In situations in which treatment is randomly assigned or a sample is randomly selected from a population, as a result of knowing this mechanism , we are able to imagine and explore alternative realities – what we will call counter-factual thinking (Chapter 9 ) – and form ways of understanding when “effects” are likely (or unlikely) to be found simply by chance – what we will call proof by stochastic contradiction (Chapter 11 ).

Finally, we would be remiss to end this chapter without including this XKCD comic, which every statistician loves:

Other Formats

Causal inference: basic concepts and randomized experiments.

Hyunseung Kang

April 3, 2024

Concepts Covered Today

Association versus causation
Defining causal quantities with counterfactual/potential outcomes
Connection to missing data
Identification of the average treatment effect in a completely randomized experiment
Covariate balance

Does daily smoking cause a decrease in lung function?

Data: 2009-2010 National Health and Nutrition Examination Survey (NHANES) .

Treatment ( \(A\) ): Daily smoker ( \(A = 1\) ) vs. never smoker ( \(A = 0\) )
Outcome ( \(Y\) ): ratio of forced expiratory volume in one second over forced vital capacity. \(Y \geq\) 0.8 is good lung function!
Sample size is \(n=\) 2360.

A Subset of the Observed Data
Lung Function (Y)	Smoking Status (A)
0.940	Never
0.918	Never
0.808	Daily
0.838	Never

Association of Smoking and Lung Function

\(\overline{Y}_{\rm daily (A = 1) }=\) 0.75 and \(\overline{Y}_{\rm never (A = 0)}=\) 0.81.
\(t\) -stat \(=\) -11.8, two-sided p value: \(\ll 10^{-16}\)

Daily smoking is strongly associated with 0.06 reduction in lung function.

But, is the strong association evidence for causality ?

Definition of Association

Association : \(A\) is associated with \(Y\) if \(A\) is informative about \(Y\)

If you smoke daily \((A = 1)\) , then it’s likely that your lungs aren’t functioning well ( \(Y\) ).
If smoking status doesn’t provide any information about lung function, \(A\) is not associated with \(Y\) .

Formally, \(A\) is associated with \(Y\) if \(\mathbb{P}(Y | A) \neq \mathbb{P}(Y)\) .

Some parameters that measure association:

Population difference in means: \(\mathbb{E}[Y | A=1] - \mathbb{E}[Y | A=0]\)
Population covariance: \({\rm cov}(A,Y) = \mathbb{E}[ (A - \mathbb{E}[A])(Y - \mathbb{E}[Y])]\)

Estimators/tests that measure association:

Sample difference in means, regression, etc.
Two-sample t-tests, Wilcoxon signed-rank test, etc.

Defining Causation: Parallel Universe Analogy

Suppose John’s lung functions are different between the two universes.

The difference in lung functions can only be attributed to the difference in smoking status.
Why? All variables (except smoking status) are the same between the two parallel universes.

Key Point : comparing outcomes between parallel universes enable us to say any difference in the outcome must be due to a difference in the treatment status.

This provides a basis for defining a causal effect of \(A\) on \(Y\) .

Counterfactual/Potential Outcomes

Notation for outcomes in parallel universes:

\(Y(1)\) : counterfactual/potential lung function if you smoked (i.e. parallel world where you smoked)
\(Y(0)\) : counterfactual/potential lung function if you didn’t smoke (i.e. parallel world where you didn’t smoke)

Similar to the observed data table, we can create counterfactual/potential outcomes data table.

	\(Y(1)\)	\(Y(0)\)
John	0.5	0.9
Sally	0.8	0.8
Kate	0.9	0.6
Jason	0.6	0.9

For pedagogy, we’ll assume that all data tables are an i.i.d. sample from some population (i.e. \(Y_i(1), Y_i(0) \overset{\text{i.i.d.}}{\sim} \mathbb{P}\{Y(1),Y(0)\}\) ).

Similar to the observed data \((Y,A)\) , you can think of the counterfactual data table as an i.i.d. from some population distribution of \(Y(1),Y(0)\) (i.e. \(Y_i(1), Y_i(0) \overset{\text{i.i.d.}}{\sim} \mathbb{P}\{Y(1),Y(0)\}\) )

This is often referred to as the super-population framework.
Expectations are defined with respect to the population distribution (i.e. \(\mathbb{E}[Y(1)] = \int y \mathbb{P}(Y(1) = y)dy\) )
The population distribution is fixed and the sampling generates the source of randomness (i.e. i.i.d. draws from \(\mathbb{P}\{Y(1),Y(0)\}\) , perhaps \(\mathbb{P}\{Y(1),Y(0)\}\) is jointly Normal?)
For asymptotic analysis, \(\mathbb{P}\{Y(1),Y(0)\}\) is usually fixed (i.e. \(\mathbb{P}\{Y(1),Y(0)\}\) does not vary with sample size \(n\) ). In high dimensional regimes, \(\mathbb{P}\{Y(1),Y(0)\}\) will vary with \(n\) .

Or, you can think of \(n=4\) as the entire population.

This is often referred to as the finite population / randomization inference or design-based framework.
Expectations are defined with respect to the table above (i.e. \(\mathbb{E}[Y(1)] = (0.5+0.8+0.9+0.6)/4 =0.7\) )
The counterfactual data table is the population and the treatment assignment (i.e. which counterfactual universe you get to see; see below) generates the randomness and the observed sample.
For asymptotic analysis, both the population (i.e. the counterfactual data table) and the sample changes with \(n\) . In some vague sense, asymptotic analysis under the finite sample framework is inherently high dimensional.

Finally, you can think of data above as a simple random sample of size \(n\) from a finite population of size \(0 < n < N < \infty\) .

The latter two frameworks are uncommon in typical statistics courses, especially the second one. However, it’s very popular among some circles of causal inference folks (e.g. Rubin, Rosenbaum and their students). The appendix of Erich Leo Lehmann ( 2006 ) , Rosenbaum ( 2002b ) , and Li and Ding ( 2017 ) provide a list of technical tools to conduct this type of inference.

There has been a long debate about which the “right” framework for inference. My understanding is that it’s now (i.e. Apr. 2024) a matter of personal taste. Also, as Paul Rosenbaum puts it:

In most cases, their disagreement is entirely without technical consequence: the same procedures are used, and the same conclusions are reached…Whatever Fisher and Neyman may have thought, in Lehmann’s text they work together. (Page 40, Rosenbaum ( 2002b ) )

The textbook that Paul is referring to is (now) Erich L. Lehmann and Romano ( 2006 ) . Note that this quote touches on another debate in the literature in finite-sample inference, which is what is the correct null hypothesis to test. In general, it’s good to be aware of the differences between the frameworks and, as Lehmann did (see full quote), use the strengths of each different frameworks. For some interesting discussions on this topic, see Robins ( 2002 ) , Rosenbaum ( 2002a ) , Chapter 2.4.5 of Rosenbaum ( 2002b ) , and Abadie et al. ( 2020 ) . For other papers in this area, see Splawa-Neyman, Dabrowska, and Speed ( 1990 ) , Freedman and Lane ( 1983 ) , Freedman ( 2008 ) , and Lin ( 2013 ) .

Causal Estimands

Some quantities/parameters from the counterfactual outcomes:

\(Y_{\rm John}(1) - Y_{\rm John}(0) = -0.4\) : Causal effect of John smoking versus not smoking (i.e. individual treatment effect )
\(\mathbb{E}[Y(1)]\) : Average of counterfactual outcomes when everyone is a daily smoker.
\(\mathbb{E}[Y(1) - Y(0)]\) : Difference in the average counterfactual outcomes when everyone is smoking versus when everyone is not smoking (i.e. average treatment effect, ATE )

A causal estimand/parameter is a function of the counterfactual outcomes.

Counterfactual Data Versus Observed Data

Table 1: Comparison of tables.

(a) Counterfactual table
	\(Y(1)\)	\(Y(0)\)
John	0.5	0.9
Sally	0.8	0.8
Kate	0.9	0.6
Jason	0.6	0.9

(b) Observed table
	\(Y\)	\(A\)
John	0.9	0
Sally	0.8	1
Kate	0.6	0
Jason	0.6	1

For both, we can define parameters (i.e. \(\mathbb{E}[Y]\) or \(\mathbb{E}[Y(1)]\) ) and take i.i.d. samples from their respective populations to learn them.

\(Y_i(1), Y_i(0) \overset{\text{i.i.d.}}{\sim} \mathbb{P}\{Y(1), Y(0)\}\) and \(\mathbb{P}\) is Uniform, etc.

If we can observe the counterfactual table, we can run your favorite statistical methods and estimate/test causal estimands.

The Main Problem of Causal Inference

If we can observe all counterfactual outcomes, causal inference reduces to doing usual statistical analysis with \(Y(0),Y(1)\) .

But, in many cases, we don’t get to observe all counterfactual outcomes.

A key goal in causal inference is to learn about the counterfactual outcomes \(Y(1), Y(0)\) from the observed data \((Y,A)\) .

How do we learn about causal parameters (e.g. \(\mathbb{E}[Y(1)]\) ) from the observed data \((Y,A)\)
What causal parameters are impossible to learn from the observed data?

Addressing this type of question is referred to as causal identification .

Causal Identification: SUTVA or Causal Consistency

First, let’s make the following assumption known as stable unit treatment value assumption (SUTVA) or causal consistency ( Rubin ( 1980 ) , page 4 of Hernán and Robins ( 2020 ) ).

\[Y = AY(1) + (1-A) Y(0)\]

Equivalently,

\[Y = Y(A) \text{ or if } A=a, Y = Y(a)\]

The assumption states the observed outcome is one realizaton of the counterfactual outcomes.

It also states that there are no multiple versions of treatment.
It also states that there is no interference , a term coined by Cox ( 1958 ) .

No Multiple Versions of Treatment

Daily smoking (i.e. \(A=1\) ) can include different type of smokers

Daily smoker who smokes one pack of cigarettes per day
Daily smoker who smokes one cigarette per day
Daily smoker who vapes per day

The current \(Y(1)\) does not distinguish outcomes between different types of smokers.

We can define counterfactual outcomes for all kinds of daily smokers, say \(Y(k)\) for \(k=1,\ldots,K\) type of daily smokers. But, if \(A=1\) , which counterfactual outcome should this correspond to?

SUTVA eliminates these variations in the counterfactuals. Or, if \(Y(k)\) exists, it assumes that these variations \(Y(1) = Y(2) = \ldots =Y(K)\) .

Implicitly, SUTVA forces you to define meaningful \(Y(a)\) . Some authors restrict counterfactual outcomes to be based on well-defined interventions or “no causation without manipulation” ( Holland ( 1986 ) , Hernán and Taubman ( 2008 ) , Cole and Frangakis ( 2009 ) , VanderWeele ( 2009 ) ).

A healthy majority of people in causal inference argue that the counterfactual outcome of race and gender are ill-defined. For example, suppose we re interested in whether being a female causes lower income. We could define the counterfactual outcomes as

\(Y(1)\) : Jamie’s income when Jamie is female
\(Y(0)\) : Jamie’s income when Jamie is not female

Similarly, we are interested in whether being a black person causes lower income, we could define the counterfactual outcomes as

\(Y(1)\) : Jamie’s income when Jamie is black
\(Y(0)\) : Jamie’s income when Jamie is not black

But, if Jamie is a female, can there be a parallel universe where Jamie is a male? That is, is there a universe where everything else is the same (i.e. Jamie’s whole life experience up to 2024, education, environment, maybe Jamie gave birth to kids), but Jamie is now a male instead of a female?

Note that we can still measure the association of gender on income, for instance with a linear regression of income (i.e \(Y\) ) on gender (i.e. \(A\) ). This is a well-defined quantity.

There is an interesting set of papers on this topic: VanderWeele and Robinson ( 2014 ) , Vandenbroucke, Broadbent, and Pearce ( 2016 ) , Krieger and Davey Smith ( 2016 ) , VanderWeele ( 2016 ) . See Volume 45, Issue 6, 2016 issue of the International Journal of Epidemiology.

Some even take this example further and argue whether counterfactual outcomes are well-defined in the first place; see Dawid ( 2000 ) and a counterpoint in Sections 1.1, 2 and 3 of Robins and Greenland ( 2000 ) .

No Interference

Suppose we want to study the causal effect of getting the measles vaccine on getting the measles. Let’s define the following counterfactual outcomes:

\(Y(0)\) : Jamie’s counterfactual measles status when Jamie is not vaccinated
\(Y(1)\) : Jamie’s counterfactual measles status when Jamie is vaccinated

Suppose Jamie has a sibling Alex and let’s entertain the possible values of Jamie’s \(Y(0)\) based on Alex’s vaccination status.

Jamie’s counterfactual measles status when Alex is vaccinated.
Jamie’s counterfactual measles status when Alexis not vaccinated.

The current \(Y(0)\) does not distinguish between the two counterfactual outcomes.

We can again define counterfactual outcomes to incorporate this scenario, say \(Y(a,b)\) where \(a\) refers to Jamie’s vaccination status and \(b\) refers to Alex’s vaccination status.

SUTVA states that Jamie’s outcome only depends on Jamie’s vaccination status, not Alex’s vaccination status. Or, more precisely \(Y(a,b) = Y(a,b')\) for all \(a,b,b'\) .

In some studies, the no interference assumption is not plausible (e.g. vaccine studies, peer effects in classrooms/neighborhoods, air pollutions). Rosenbaum ( 2007 ) has a nice set of examples of when the no interference assumption is not plausible.

There is a lot of ongoing work on this topic ( Rosenbaum ( 2007 ) , Hudgens and Halloran ( 2008 ) , Tchetgen and VanderWeele ( 2012 ) ). I am interested in in this area as well and let me know if you want to learn more.

Causal Identification and Missing Data

Once we assume SUTVA (i.e. \(Y= AY(1) + (1-A)Y(0)\) ), causal identification can be seen as a problem in missing data.

	\(Y(1)\)	\(Y(0)\)	\(Y\)	\(A\)
John	NA	0.9	0.9	0
Sally	0.8	NA	0.8	1
Kate	NA	0.6	0.6	0
Jason	0.6	NA	0.6	1

Under SUTVA, we only see one of the two counterfactual outcomes based on \(A\) .

\(A\) serves as the “missingness” indicator where \(A=1\) implies \(Y(1)\) is observed and \(A=0\) implies \(Y(0)\) is observed.
\(Y\) is the “observed” value.
Being able to only observe one counterfactual outcome in the observed data is known as the ``fundamental problem of causal inference’’ (page 476 of Holland ( 1988 ) ).

Assumption on Missingness Pattern

	\(Y(1)\)	\(Y(0)\)	\(Y\)	\(A\)
John	NA	0.9	0.9	0
Sally	0.8	NA	0.8	1
Kate	NA	0.6	0.6	0
Jason	0.6	NA	0.6	1.

Suppose we are interested in learning the causal estimand \(\mathbb{E}[Y(1)]\) (i.e. the mean of the first column).

One approach would be to take the average of the “complete cases” (i.e. Sally’s 0.8 and Jason’s 0.6).

Formally, we would use \(\mathbb{E}[Y | A=1]\) , the mean of the observed outcome \(Y\) among \(A=1\) .
This approach is valid if the entries of the first column are missing completely at random (MCAR)
In other words, the missingness indicator \(A\) flips a random coin per each individual and decides whether its \(Y(1)\) is missing or not.

This is essentially akin to a randomized experiment.

Formal Statement of MCAR

Formally, MCAR can be stated as \[A \perp Y(1) \text{ and } 0 < \mathbb{P}(A=1)\]

Missingness occurs completely at random in the rows of the first column, say by a flip of a random coin.
Missingness doesn’t occur more frequently for higher values of \(Y(1)\) ; this would violate \(A \perp Y(1)\) .
If \(\mathbb{P}(A=1) =0\) , then all entries of the column \(Y(1)\) are missing and we can’t learn anything about its column mean.

Formal Proof of Causal Identification of \(\mathbb{E}[Y(1)]\)

Suppose SUTVA and MCAR hold:

(A1): \(Y = A Y(1) + (1-A) Y(0)\)
(A2): \(A \perp Y(1)\)
(A3): \(0 < \mathbb{P}(A=1)\)

Then, we can identify the causal estimand \(\mathbb{E}[Y(1)]\) by writing it as the following function of the observed data \(\mathbb{E}[Y | A=1]\) : \[\begin{align*} \mathbb{E}[Y | A=1] &= \mathbb{E}[AY(1) + (1-A)Y(0) | A=1] && \text{(A1)} \\ &= \mathbb{E}[Y(1)|A=1] && \text{Definition of conditional expectation} \\ &= \mathbb{E}[Y(1)] && \text{(A2)} \end{align*}\] (A3) is used to ensure that \(\mathbb{E}[Y | A=1]\) is a well-defined quantity.

Technically speaking, to establish \(\mathbb{E}[Y(1)] = \mathbb{E}[Y | A=1]\) , we only need \(\mathbb{E}[Y(1) | A=1] = \mathbb{E}[Y(1)]\) and \(0 < \mathbb{P}(A=1)\) instead of \(A \perp Y(1)\) and \(0 < \mathbb{P}(A=1)\) ; note that \(A \perp Y(1)\) is equivalent to \(\mathbb{P}(Y(1) | A=1) = \mathbb{P}(Y(1))\) . In words, we only need \(A\) to be unrelated to \(Y(1)\) in expectation , not necessarily in the entire distribution.

Causal Identification of the ATE

In a similar vein, to identify the ATE \(\mathbb{E}[Y(1)-Y(0)]\) , a natural approach would be to use \(\mathbb{E}[Y | A=1] - \mathbb{E}[Y | A=0]\) , respectively.

This approach would be valid under the following variation of the MCAR assumption: \[A \perp Y(0),Y(1), \quad{} 0 < \mathbb{P}(A=1) < 1\]

The first part states that the treatment \(A\) is independent of \(Y(1), Y(0)\) . This is called exchangeability or ignorability in causal inference.
\(0 < \mathbb{P}(A=1) <1\) states that there is a non-zero probability of observing some entries from the column \(Y(1)\) and from the column \(Y(0)\) . This is called positivity or overlap in causal inference.

Note that \(A \perp Y(1)\) (i.e. missingness indicator \(A\) for the \(Y(1)\) column is completely random) is not equivalent to \(A \perp Y\) (i.e. \(A\) is not associated with \(Y\) ), with or without SUTVA.

Without SUTVA, \(Y\) and \(Y(1)\) are completely different variables and thus, the two statements are generally not equivalent to each other. In other words, \(A \perp Y(1)\) makes an assumption about the counterfactual outcome whereas \(A \perp Y\) makes an assumption about the observed outcome.
With SUTVA, \(Y = Y(1)\) only if \(A =1\) and thus, \(A \perp Y(1)\) does not necessarily imply that \(Y = AY(1) + (1-A)Y(0)\) is independent of \(A\) . To put it differently, \(A \perp Y(1)\) only tells about the lack of relationship between the column of \(Y(1)\) and the column of \(A\) . In contrast, \(A \perp Y\) tells me about the lack of relationship between the column \(Y\) , which is a mix of \(Y(1)\) and \(Y(0)\) under SUTVA, and the column of \(A\) .

Formal Proof of Causal Identification of the ATE

(A2): \(A \perp Y(1), Y(0)\)
(A3): \(0 < \mathbb{P}(A=1) < 1\)

Then, we can identify the ATE from the observed data via: \[\begin{align*} &\mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0] \\ =& \mathbb{E}[AY(1) + (1-A)Y(0) | A=1] \\ & \quad{} - \mathbb{E}[AY(1) + (1-A)Y(0) | A=0] && \text{(A1)} \\ =& \mathbb{E}[Y(1)|A=1] - \mathbb{E}[Y(0) | A=0] && \text{Definition of conditional expectation} \\ =& \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] && \text{(A2)} \end{align*}\]

(A3) ensures that the conditioning events in \(\mathbb{E}[\cdot |A=0]\) and \(\mathbb{E}[\cdot |A=1]\) are well-defined.

Suppose there is no association between \(A\) and \(Y\) , i.e., \(A \perp Y\) , and suppose (A3) holds. Then, \(\mathbb{E}[Y |A=1] = \mathbb{E}[Y|A=0] = \mathbb{E}[Y]\) . If we further assume SUTVA (A1), this implies that the average treatment effect (ATE) is zero \(\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = 0\) .

Notice that SUTVA is required to claim that the ATE is zero if there is no association between \(A\) and \(Y\) . In general, without SUTVA, we can’t make any claims about \(Y(1)\) and \(Y(0)\) from any analyais done with the observed data \(Y,A\) since SUTVA links the counterfactual outcomes to the observed data.

Why Randomized Experiments Identify Causal Effects

Consider an ideal, completely randomized experiment (RCT):

Treatment & control are well-defined (e.g. take new drug or placebo)
Counterfactual outcomes do not depend on others’ treatment (e.g. taking the drug/placebo only impacts my own outcome)
Assignment to treatment or control is completely randomized
There is a non-zero probability of receiving treatment and control (e.g. some get drug while others get placebo)

Assumptions (A1)-(A3) are satisfied because

From 1 and 2, SUTVA holds.
From 3, treatment assignment \(A\) is completely random, i.e. \(A \perp Y(1), Y(0)\)
From 4, \(0 < P(A=1) <1\)

This is why RCTs are considered the gold standard for identifying causal effects as all assumptions for causal identification are satisfied by the experimental design.

RCTs with Covariates

In addition to \(Y\) and \(A\) , we often collect pre-treatment covariates \(X\) .

	\(Y(1)\)	\(Y(0)\)	\(Y\)	\(A\)	\(X\) (Age)
John	NA	0.9	0.9	0	38
Sally	0.8	NA	0.8	1	30
Kate	NA	0.6	0.6	0	23
Jason	0.6	NA	0.6	1	26

If the treatment \(A\) is completely randomized (as in an RCT), we would also have \(A \perp X\) .

Note that we can then combine this into the existing (A2) as (A2): \[A \perp Y(1), Y(0), X\] Other assumptions, (A1) and (A3), remain the same.

Causal Identification of The ATE with Covariates

Even with the change in (A2), the proof to identify the ATE in an RCT remains the same as before.

(A2): \(A \perp Y(1), Y(0),X\)

Then, we can identify the ATE from the observed data via:

\[ \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0] \] However, we can also identify the ATE via \[ \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}[\mathbb{E}[Y | X,A=1]|A=1] - \mathbb{E}[\mathbb{E}[Y | X,A=0]|A=0] \]

The new equality simply uses the law of total expectation, i.e. \(\mathbb{E}[Y|A=1] = \mathbb{E}[\mathbb{E}[Y|X,A=1]|A=1]\) . However, this new equality requires modeling \(\mathbb{E}[Y | X,A=a]\) correctly. We’ll discuss more about this in later lectures.

Covariate Balance

An important, conceptual implication of complete randomization of the treatment (i.e. \(A \perp X\) ) is that \[\mathbb{P}(X |A=1) = \mathbb{P}(X | A=0)\] This concept is known as covariate balance where the distribution of covariates are balanced between treated units and control units.

Often in RCTs (and non-RCTs), we check for covariate balance by comparing the means of \(X\) s among treated and control units (e.g. two-sample t-test of the mean of \(X\) ). This is to ensure that randomization was actually carried out properly.

In Chapter 9.1 of Rosenbaum ( 2020 ) , Rosenbaum recommends using the pooled variance when computing the difference in means of a covariate between the treated group and the control group. Specifically, let \({\rm SD}(X}_{A=1}\) be the standard deviation of the covariate in the treated group and \({\rm SD}(X}_{A=0}\) be the standard deviation of the covariate in the control group. Then, Rosenbaum suggests the statistic

\[ \text{Standardized difference in means} = \frac{\bar{X}_{A=1}-\bar{X}_{A=0}}{\sqrt{ ({\rm SD}(X)_{A=1}^2 + {\rm SD}(X)_{A=0}^2)/2}} \]

RCT Balances Measured and Unmeasured Covariates

Critically, the above equality would hold even if some characteristics of the person are unmeasured (e.g. everyone’s precise health status).

Formally, let \(U\) be unmeasured variables and \(X\) be measured variables.
Because \(A\) is completely randomized in an RCT, we have \(A \perp X, U\) and \[\mathbb{P}(X,U |A=1) = \mathbb{P}(X,U | A=0)\]

Complete randomization ensures that the distribution of both measured and unmeasured characteristics of individuals are the same between the treated and control groups.

Randomization Creates Comparable Groups

Roughly speaking, completely randomization creates two synthetic, parallel universes where, on average, the characteristics between universe \(A=0\) and universe \(A=1\) are identical.

Thus, in an RCT, any difference in \(Y\) can only be attributed a difference in the group label (i.e. \(A\) ) since all measured and unmeasured characteristics between the two universes are distributionally identical.

This was essentially the “big” idea from Fisher in 1935 where he used randomization as the “reasoned basis’’ for causal inference from RCTs. Paul Rosenbaum explains this more beautifully than I do in Chapter 2.3 of Rosenbaum ( 2020 ) .

Note About Pre-treatment Covariates

We briefly mentioned that covariates \(X\) must precede treatment assignment, i.e.

We collect \(X\) (i.e. baseline covariates)
We assign treatment/control \(A\)
We observe outcome \(Y\)

If they are post-treatment covariates, then the treatment can have a causal effect on both the outcome \(Y\) and the covariates \(X\) .

In this case, it’s not unclear whether \(Y\) has a causal effect because of a causal effect in \(X\) . Studying this type of question is called causal mediation analysis.

In general, we don’t want to condition on post-treatment covariates \(X\) when the goal is to estimate the average treatment effect of \(A\) on \(Y\) .

Analytic Perspective
Open access
Published: 17 June 2009

The role of causal criteria in causal inferences: Bradford Hill's "aspects of association"

Andrew C Ward 1

Epidemiologic Perspectives & Innovations volume 6 , Article number: 2 ( 2009 ) Cite this article

20k Accesses

55 Citations

8 Altmetric

Metrics details

As noted by Wesley Salmon and many others, causal concepts are ubiquitous in every branch of theoretical science, in the practical disciplines and in everyday life. In the theoretical and practical sciences especially, people often base claims about causal relations on applications of statistical methods to data. However, the source and type of data place important constraints on the choice of statistical methods as well as on the warrant attributed to the causal claims based on the use of such methods. For example, much of the data used by people interested in making causal claims come from non-experimental, observational studies in which random allocations to treatment and control groups are not present. Thus, one of the most important problems in the social and health sciences concerns making justified causal inferences using non-experimental, observational data. In this paper, I examine one method of justifying such inferences that is especially widespread in epidemiology and the health sciences generally – the use of causal criteria. I argue that while the use of causal criteria is not appropriate for either deductive or inductive inferences, they do have an important role to play in inferences to the best explanation. As such, causal criteria, exemplified by what Bradford Hill referred to as "aspects of [statistical] associations", have an indispensible part to play in the goal of making justified causal claims.

Introduction

As noted by Salmon [ 1 ] and others [ 2 , 3 ], causal concepts are ubiquitous in every branch of theoretical science, in the practical disciplines and in everyday life. In the case of the social sciences, Marini and Singer write that "the identification of genuine causes is accorded a high priority because it is viewed as the basis for understanding social phenomena and building an explanatory science" [ 4 ]. Although health services research is not so interested in "building an explanatory science", it too, like the social sciences with which it often overlaps, sets a premium on identifying genuine causes [ 5 ]. Establishing "an argument of causation is an important research activity," write van Reekum et al., "because it influences the delivery of good medical care" [ 6 ]. Moreover, given the keen public and political attention given recently to issues of health care insurance and health care delivery, a "key question" for federal, state and local policy makers that falls squarely within the province of health services research is how much an effect different kinds of health insurance interventions have on people's health, "and at what cost" [ 7 ].

This focus on causality and causal concepts is also pervasive in epidemiology [ 8 – 14 ], with Morabia suggesting that a name "more closely reflecting" the subject matter of epidemiology is "'population health etiology', etiology meaning 'science of causation"' [ 15 ]. For example, Swaen and Amelsvoort write that one "of the main objectives of epidemiological research is to identify causes of diseases" [ 16 ], while Botti, et al. write that a "central issue in environmental epidemiology is the evaluation of the causal nature of reported associations between exposure to defined environmental agents and the occurrence of disease. [ 17 ]" Gori writes that epidemiologists "have long pressed the claim that their study belongs to the natural sciences ... [and seek] to develop theoretical models and to identify experimentally the causal relationships that may confirm, extend, or negate such models" [ 18 ], and Oswald even goes so far as to claim that epidemiologists are "obsessed with cause and effect. [ 19 ]" Of course, it is true that some writers [ 20 ] are a bit more cautious when describing how considerations of causality fit into the goals of epidemiology. Weed writes that the "purpose of epidemiology is not to prove cause-effect relationships ... [but rather] to acquire knowledge about the determinants and distributions of disease and to apply that knowledge to improve public health. [ 21 ]" Even here, though, what seems implicit is that establishing cause-and-effect relationships is still the ideal goal of epidemiology, and as Weed himself writes in a later publication, finding "a cause, removing it, and reducing the incidence and mortality of subsequent disease in populations are hallmarks of public health and practice" [ 22 ].

Often people base claims about the existence and strength of causal relations on applications of statistical methods to data. However, the source and type of data place important constraints on the choice of statistical methods as well as on the warrant attributed to the causal claims based on the use of such methods [ 23 ]. In this context, Urbach writes that an "ever-present danger in ... investigations is attributing the outcome of an experiment to the treatment one is interested in when, in reality, it was caused by some extraneous variation in the experimental conditions" [ 24 ]. Expressed in a counterfactual framework, the danger is that while the causal contrast we want to measure is that between a target population under one exposure and, counterfactually, that same population under a different exposure, the observable substitute we use for the target population under the counterfactual condition may be an imperfect substitute [ 25 , 26 ]. When the observable substitute is an imperfect substitute for the target population under the counterfactual condition, the result is confounding, and the measure of the causal contrast is confounded. In order to address this "ever-present danger", many users of statistical methods, especially those of the Neyman-Pearson or Fisher type [ 27 , 28 ], claim that randomization is necessary.

Ideally, what randomization (random allocation to treatment and control or comparison groups) does is two-fold. First, following Greenland, the average of many hypothetical repetitions of a randomized control trial (RCT) will make "our estimate of the true risk difference statistically unbiased, in that the statistical expectation (average) of the estimate over the possible results equals the true value" [ 29 ]. In other words, randomization addresses the problem of statistical bias. However, as pointed out by Greenland [ 29 ], without some additional qualification, an ideally performed RCT does not "prevent the epidemiologic bias known as confounding" [ 29 ]. To reduce the probability of confounding, idealized random allocation must be used to create sufficiently large comparison groups. As Greenland notes, by using "randomization, one can make the probability of severe confounding as small as one likes by increasing the size of the treatment cohorts" [ 29 ]. For example, using the example in Greenland, Robins and Pearl, suppose that "our objective is to determine the effect of applying a treatment or exposure x 1 on a parameter μ of the distribution of the outcome y in population A, relative to applying treatment or exposure x 0 " [ 30 ]. Further, let us suppose that "μ will equal μ A1 if x 1 is applied to population A, and will equal μ A0 if x 0 is applied to that population" [ 30 ]. In this case, we can measure the causal effect of x 1 relative to x 0 by μ A1 -μ A0 . However, we cannot apply both x 1 and x 0 to the same population. Thus, if A is the target population, what we need is some population B for which μ B1 is known to equal (has a high likelihood of equaling) μ A1 , and some population C for which μ C0 is known to equal (has a high likelihood of equaling) μ A0 . To create these two groups, we randomly sample from A. If the randomization is ideal and the treatment cohorts (B and C) are sufficiently large, then we can expect, in probability, that the outcome in B would be the outcome if everyone in A were exposed to x 1 , while the outcome in C would be the outcome if everyone in A were exposed to x 0 . Thus, what idealized randomization does, when the treatment cohorts created by random selection from the target population are sufficiently large, is to create two sample populations that are exchangeable with A under their respective treatments (x 1 and x 0 ). In this way, a sufficiently large, perfectly conducted RCT controls for confounding, in probability, because the randomized allocation into B and C is, in effect, random sampling from the target population A to create reference populations B and C that are exchangeable with A. As Hernán notes, in "ideal randomized experiments, association is causation" [ 31 ].

Hernán's claim that in idealized randomized experiments, "association is causation", is a contemporary restatement of a view presented earlier by the English statistician and geneticist R. A. Fisher. According to Fisher, "to justify the conclusions of the theory of estimation, and the tests of significance as applied to counts or measures arising in the real world, it is logically necessary that they too must be the results of a random process" [ 32 ]. It is this contention, captured succinctly by Hernán, that is the centerpiece of the widely held belief that randomized clinical trials (RCTs) are, and ought to be, the "gold standard" of evaluating the causal efficacy of interventions (treatments) [ 33 – 36 ]. Thus, Machin writes that it is likely that "the single most important contribution to the science of comparative clinical trials was the recognition more than 50 years ago that patients should be allocated the options under consideration at random [ 37 ]. Similarly, while she believes that the value of RCTs depends crucially on the subject matter and the assumptions one is willing to make [ 38 ], Cartwright notes that many evidence-based policies call for scientific evidence of efficacy before being agreed to, and that government and other agencies typically claim that the best evidence for efficacy comes from RCTs [ 39 ].

Although generally considered the gold standard of research whose goal is to make justified causal inferences, it should come as no surprise that there is a variety of limitations associated with the use of RCTs. Some of these limitations are practical. For example, not only are RCTs typically expensive and time-consuming, there are important ethical questions raised when needed resources, that are otherwise limited or scarce, are randomly allocated. Similarly, it seems reasonable to worry about the ethical permissibility of an RCT when its use requires withholding a potentially beneficial treatment from people who might otherwise benefit from being recipients of the treatment. In addition to these practical concerns, there is also a variety of methodological limitations. Even if an idealized RCT is internally valid, generalizations from it to a wider population may be very limited. As noted by Silverman, a "review of epidemiological data and inclusion and exclusion criteria for trials of antipsychotic treatments revealed that only 632 of an estimated 36,000 individuals with schizophrenia would meet basic eligibility requirements for participation in a randomized controlled experiment" [ 40 ]. In such cases, even if there are no problems with differential attrition, the exportation of a finding from the experimental population to a target population may well go beyond what is justified by the use of RCTs. Even more generally, there is no guarantee either that the observable substitute for the target population under the counterfactual condition is a "good" substitute, or that a single RCT will result in a division in which possible confounders of the measured outcome are randomly distributed. Regarding the latter point, Worrall remarks that even for an impeccably designed and carried out RCT, "all agree that in any particular case this may produce a division which is, once we think about it, imbalanced with respect to some factor that plays a significant role in the outcome being measured" [ 41 ]. While it may be possible to reduce the probability of such baseline imbalances by multiple repetitions of the RCT, these repetitions, whose function is to give the limiting average effect [ 42 ], may not be practically feasible. Moreover, at least when the repetitions are "real life" repetitions and not computer simulations, there is no reason to believe that each of the repetitions will be "ideal", and more reasons to believe that they will not all be ideal. For this reason, multiple (real life) repetitions of the RCT are more likely to increase the likelihood of other kinds of bias, such as differential attrition, not controlled for by use of an RCT.

Of course there are a variety of approaches that one can take in attempting to meet these, and other limitations of RCTs. While not intending to downplay the importance of RCTs and the attempts to address the limitations associated with their use, much of the data used by people interested in making causal claims do not come from experiments that use random allocation to control and treatment or comparison groups. Indeed, as Herbert Smith writes, few "pieces of social research outside of social psychology are based on experiments" [ 43 ]. Thus, one of the most important problems in the social and health sciences, as well as in epidemiology, concerns whether it is possible to make warranted causal claims using non-experimental, observational data. The focus on observational data, as opposed to experimental data, leads us away from RCTs and towards an examination of what Weed has called the "most familiar component of the epidemiologist's approach to causal inference", viz., "causal criteria" [ 44 ]. In the context suggested by the quotation from Weed, the argument presented in this paper has three parts. First, I argue that, properly understood, causal inferences that make use of causal criteria, exemplified by the Bradford Hill "criteria", are neither deductive nor inductive in character. Instead, such inferences are best understood as instances of what philosophers call "inference to the best explanation". Second, I argue that even understood as components of an inference to the best explanation (the causal claim being the best explanation), causal criteria have many problems, and that the inferences their use sanctions are, at best, very weak. Finally, I conclude that while the inferential power of causal criteria is weak, they still have a pragmatic value; they are tools, in the toolkits of people interested in making causal claims, for preliminary assessments of statistical associations. To vary a remark by Mazlack about "association rules", while satisfactions of causal criteria (such as the Bradford Hill criteria with which this paper principally deals) do not warrant causal claims, their judicious application is important and, perhaps in many cases, indispensible for identifying interesting statistical relationships that can then be subjected to a further, more analytically rigorous statistical examination [ 45 ].

Relative to RCTs, the absence of random allocation to treatment and control or comparison groups is what leads to one of, though not all of, the most important methodological issues observational, non-experimental studies face. In the absence of randomized allocations from a sufficiently large population to treatment and control or comparison groups, we no longer have a probabilistic guarantee that there is no statistical bias and that we have minimized the probability of confounding. Thus, because there is no random allocation in an observational study, and because, as noted by Little and Rubin, without "a model for how treatments are assigned to units, formal causal inference, at least using probabilistic statements, is impossible" [ 46 ], some other method of allocation (and set of assumptions) is needed for observational studies. One possibility, according to Little and Rubin, is that researchers may statistically control for "recorded confounders" and then assume, either explicitly or implicitly, that the non-randomized "treatment assignment corresponds to that of an unconfounded randomized experiment" [ 46 ]. A problem with this method is that the assumption is not testable, and frequently made without any good theoretical support. Nevertheless, while observational studies may take a variety of different forms, they do all share an important characteristic with RCTs; viz., all those non-statistical aspects of RCTs, apart from their use of randomized allocation, that go towards making them well-designed experiments and contribute to causal inferences, are also important in well-designed observational, non-experimental studies from which causal inferences are drawn. Put a bit more precisely, any non-statistical characteristic whose presence is, in the case of RCTs, necessary for a well-founded causal inference to a causal claim (e.g. compliance to assigned treatments by subjects, any missing data having the same distribution as observed data) is also necessary for a well-founded causal inference in the case of observational, non-experimental studies. Thus, as William Cochran who, according to Rosenbaum, was one of the first to present observational studies "as a topic defined by principles and methods of statistics" [ 47 ] remarks, "to a large extent, workers in observational research have tried to copy devices that have proved effective in controlled experiments" [ 48 ].

However, suppose that one is not willing to assume that the non-randomized treatment cohort in an observational study "corresponds" to the treatment cohort in an unconfounded randomized experiment using the same sample (study) population. In this case, assurances that the non-statistical characteristics of a well-designed and executed RCT are also present in the observational study are not sufficient to make well-founded causal inferences from the observational data. Something more is needed. It is at this point that people interested in making well-founded causal inferences based on observational data differ in their methodological approaches. One approach is to use one or more appropriately chosen statistical methods to model observational data in such a way that the RCT interventionist method of random allocations into treatment and control or comparison groups is, in one way or another, captured by the characteristics of the model. This is the idea behind Rubin's claim that an "observational study should be conceptualized as a broken randomized experiment" that we use statistical methods to fix as best we can [ 49 ], and Freedman's similar remark that "one objective of statistical modeling is to create an analogy, perhaps forced, between an observational study and an experiment" [ 50 ]. For example, a method widely used in epidemiology, the social sciences and health services research to capture observed imbalances in covariate patterns among groups, and so justify inferences that changes in one or more independent variables cause changes in a dependent variable, is to use regression models [ 51 , 52 ]. According to Clogg and Haritou, one of the central underlying assumptions in what they refer to as the "regression method of causal inference" is that "experimental manipulation or control through randomization can be replaced by statistical control or partialing with a regression model, along with a few assumptions that seem benign to most researchers" [ 53 ]. Whether those "few assumptions" (e.g. assumptions about functional form, what variables to include or exclude from the regression equation [ 54 ] and random allocation of treatment within strata for the controlled variables) are genuinely plausible and "benign" in most real-world situations is a matter of some debate [ 51 ].

More recently, propensity score estimation (using regression as part of the process, but with no attempt to interpret regression coefficients causally) and matching has emerged as a method to warrant claims about average causal effects and average causal effects on the treated [ 49 , 55 , 56 ]. Introduced by Paul Rosenbaum and David Rubin in 1983, the propensity score is the conditional property of a subject/unit in a sample (study) population being exposed or treated, given a set of observed covariates that one believes predicts the exposure or treatment [ 57 ]. The idea, roughly, is that once we have the estimated propensity score, we can match "subjects in exposed and unexposed conditions on their propensity scores" [ 58 ]. On the assumption that the matched samples are balanced with respect to the set of observed covariates, and on the further assumption, questioned by some, that "if both subjects have the same [estimated] probability of exposure, it is random which one was in fact exposed and which was not", we have simulated random allocation [ 59 ]. After this, it is a relatively straightforward exercise to estimate a causal effect of exposure or treatment [ 59 ]. Like the use of regression models to estimate causal effects, the use of propensity scores and matching (or some other methods such as stratification or weighting on the propensity score) to estimate causal effects makes a number of assumptions. For example, as suggested above, using propensity scores to address problems of statistical bias and confounding requires assuming that population members with similar estimated probabilities of exposure are exchangeable with respect to disease (outcome) frequency. Depending on the observational study, these assumptions either may be implausible or may place serious limitations what causal inferences one can justifiably make [ 58 , 59 ].

There is, to be sure, much to value in approaching questions of causality in terms of fitting statistically well-defined models to the available data. In this connection, Heckman writes that a "major contribution of twentieth century econometrics was the recognition that causality and causal parameters are most fruitfully defined within formal economic models and that comparative statistics within these models ... most clearly define causal parameters [ 60 ]. Similarly, while acknowledging "statistical associations do not logically imply causation," Pearl claims that under "the assumptions of model minimality (and/or stability), there are patterns of dependencies that should be sufficient to uncover genuine causal relationships" [ 61 ]. However, at least in the case of observational studies, not everyone is sanguine about the use of statistically well-defined models to answer questions about the presence and relative strength of cause-and-effect relationships. Part of the reticence to embracing statistically based causal inferences is the worry that these kinds of inferences presuppose that the statistically modeled data are the products of randomized allocation, while part of the worry is that statistical modeling, by itself, cannot justify making causal inferences without the addition of non-statistically based assumptions. For example, Pearl writes that in those studies in which there is no random allocation (what he refers to as "imperfect experiments") "reasonable assumptions about the salient relationships in the domain" must be used to determine bounds of the causal effect of an exposure or treatment [ 61 ]. Freedman's criticism of Spirtes, Glymour and Scheines' attempt [ 62 ] to discover causal relationships by the use of directed graphs to represent statistical independence and dependence relationships between variables used in the graph makes an analogous point. According to Freedman, while the use of directed graphs and the associated algorithms by Sprites, Glymour and Scheines has "some technical interest", they will justify drawing causal inferences "only when causation is assumed in the first place" [ 63 ]. Put a bit more charitably, unless there are independent reasons for believing that statistical associations are causal relations, there is no justification, using only these kinds of statistical models, to infer that the statistical associations are causal relations.

For these reasons (and there is no implication intended here that these exhaust the reasons), a second approach for justifying causal inferences, and so warranting the causal claims based on those inferences, has developed. This approach, often adopted independently of the statistically based approach to justifying causal inferences, focuses on identifying and describing the conditions that must be satisfied in order for the belief, that a statistical association between two events is a causal relationship, to be a justified (warranted) belief. Although Susser refers to this as a strategy in which "making inferences about causes" depends on the "subjective judgment" of the person making the judgment [ 64 ], this is not an altogether fair characterization. As has already been noted, approaching the problems posed by causal inference using statistical models and estimating causal parameters within those models requires making a variety of assumptions and so inevitably involves "subjective judgment". Subjective judgments are ubiquitous in any account of causal inference, and so is not a characteristic that permits distinguishing formal, statistically based causal inferences from causal inferences based on some other approach. Instead, what distinguishes the conditions-based approach is precisely the idea that a statistical association is a causal relation just in case that association satisfies some set of criteria that is neither reducible to, nor eliminable in favor of the specification of some set of formal statistical models of the statistical association. Thus, Greenland characterizes this approach as one not based "on a formal causal model", and refers to it as the "canonical approach" since it "usually leaves terms like 'cause' and 'effect' as primitives ... around which ... self-evident canons [criteria] are built, much like axioms are built around the primitives of 'set' and 'is an element of' in mathematics" [ 65 ]. Historically, the "canonical approach" is evidenced in the 1964 Surgeon General's report on the dangers of smoking. According to the Report:

Statistical methods cannot establish proof of a causal relationship in an association. The causal significance of the association is a matter of judgment which goes beyond any statement of statistical probability [ 66 ].

In effect, the Report is stating that no formal statistical modeling of the data can, without additional, non-statistical assumptions, justify drawing a causal inference (and so drawing a warranted causal claim) from any statistical associations that are present. Because of this limitation of statistical modeling, the Report goes on to state that to "judge or evaluate the causal significance of the association between the attribute or agent and disease, or effect upon health, a number of criteria must be utilized, no one of which is an all-sufficient basis for judgment" [ 66 ]. The criteria used in the Report were the consistency, strength, specificity, temporal relationship, and coherence of the association.

Following the publication of the Surgeon General's Report, Austin Bradford Hill, in his 1965 Presidential Address to the Section of Occupational Medicine of the Royal Society of Medicine, asked under what circumstances we can justifiably pass from "an observed association to a verdict of causation " [ 67 ]. In answer to this question, Bradford Hill recommended the use of the five criteria present in the Surgeon General's Report, and added four others, viz., biological gradient, plausibility, experiment and analogy [ 67 ]. Although he described the circumstances whose presence permitted passing from an observed observation to a verdict of causation as "aspects of [a statistical] association" we should "consider before deciding that the most likely interpretation of it is causation" [ 67 ], the resulting nine criteria are now typically referred to as the "Bradford Hill Criteria" for causal inferences. It is true that writers such as Phillips and Goodman object to calling Bradford Hill's aspects of association "criteria", preferring instead the locution "causal considerations" [ 68 ], but they also concede that what Bradford Hill proposed is "frequently taught to students in epidemiology and referred to in the literature as 'causal criteria"' [ 69 ]. Moreover, while commonly used in epidemiology and the health sciences since 1965 as a "central tool for the epidemiological community in grappling with the broader issues of causal reasoning" [ 70 ], the "basic outline of the modern set of criteria has," according to Kaufman and Poole, "evolved little" since their formulation by the Surgeon General's Advisory Committee and Bradford Hill [ 70 ].

There are many examples of studies that use the Bradford Hill criteria (or some subset of the criteria) in an attempt to justify causal inferences. One clear and publicly accessible example of their use is on the Website of the SV40 Cancer Foundation. There, Horwin applies "what was published in the peer-reviewed medical literature to the nine Bradford Hill criteria in respect to medulloblastoma and other brain cancers" to demonstrate the causal efficacy of SV40 [ 71 ]. In addition, the Environmental Protection Agency's 2005 "Guidelines for Carcinogen Risk Assessment", also publicly accessible, explicitly recommends the use of the Bradford Hill criteria to assess whether an observed statistical association is causal rather than spurious [ 72 ]. There are many more examples of applications of the Bradford Hill criteria that appear in academic journals covering a range of disciplines. These examples include, but are not limited to, determining whether chrysotile asbestos causes mesothelioma [ 73 ], determining whether second generation antipsychotic drugs cause diabetes [ 74 ], evaluating the effects of "environmental carcinogens" [ 75 ], evaluating whether abuse experienced as a child or as an adolescent/adult is causally related to urologic symptoms [ 76 ], and evaluating causal associations in pharmacovigilance as well as pharmacoepidemiology [ 77 , 78 ]. The Bradford Hill criteria have even been applied to studies in molecular epidemiology [ 79 ], as well as to when searching "for the true effectiveness" of dental health care services in facilitating "recovery from an oral health-related decrement in quality of life called 'oral disadvantage due to disease and tissue damage'. [ 80 ]" Overall, regardless of the specific discipline in which the study occurred, the most common use of the Bradford Hill criteria when investigating whether a statistical association is a causal relationship (e.g. the statistical association between genital ulcer disease and the transmission of human immunodeficiency virus [ 81 ]) is to apply them to evidence presented in reviewed literature [ 73 , 74 , 81 – 87 ].

Based on their widespread use, it is not surprising that some form of Bradford Hill's causal criteria are, according to Weed, "arguably the most commonly-used method of interpreting scientific evidence in public health" [ 88 ], and that, according to Parascandola, the Bradford Hill criteria are "routinely cited as authoritative statements of the proper method for assessing a body of etiological evidence" [ 89 ]. Indeed, Shakir and Layton even go so far as to write that Bradford Hill's Presidential Address, in which the nine criteria ("aspects of association") were identified and described, was one "of the most important papers published in the 20th century with thoughts on the epidemiological basis of disease causation" [ 77 ]. Still, just as the popular consent to a belief does not make that belief true, so too, the widespread acceptance and use of Bradford Hill criteria does not entail that their use truly justifies causal inferences. Thus, we need to examine, carefully and critically, the Bradford Hill criteria to determine precisely what their function is, if any, in justifying causal inferences.

The first thing to keep in mind is that 'inference' has at least two meanings that it is important not to conflate. The first meaning of 'inference' is the psychological activity of accepting a conclusion based on one or more other beliefs held to be true. For example, when consumer psychologists study under what circumstances consumers generalize from specific information to general conclusions, or construe specific conclusions from general principles or assumptions [ 90 ], they are studying inference as a psychological activity. It is this sense of inference that is important when characterizing rationality [ 91 ]. The second meaning of 'inference' is about logical permissibility; it refers to whether one is logically permitted to assert that a particular claim is true because of its evidential relationship to one or more other claims (hypothetically) accepted as true. Here the focus is not on the psychology of people engaged in reasoning, but on the relationship between evidence (claims held true) and a claim asserted to be true. When applying Bradford Hill criteria to causal inferences (inferences having a causal claim as a conclusion), it is the second meaning of 'inference' that is relevant, not the first. In other words, inference, in the context of applications of Bradford Hill criteria, does not refer to the psychological activity of "transitioning" (reasoning) from a set of beliefs to another belief, but instead refers to the kind of evidential relationship that exists between a claim (e.g. a causal claim such as "X causes Y") and the evidence for that claim.

Typically, evidential relationships between evidence held true (the premises) and a claim asserted to be true (the conclusion) because of the evidence are characterized as either deductive or inductive. In the first case, if the deductive relationship is a valid one, then the truth of the evidence guarantees that the asserted claim, the conclusion, is true. Again, it is important to emphasize here that this is a claim about logical implication, not about reasoning. As noted by Harman, it is "an interesting and nontrivial problem to say just how deductions are relevant to reasoning," but it is an interesting and nontrivial problem just because deductive relationships are not instances of reasoning [ 92 ]. In the second case, if the inductive relationship is a strong one, then, following Skyrms, "it is improbable, given that the premises [evidence presented in the form of statements] are true, that the conclusion is false" [ 93 ]. Thus, in the case of the inductive relationship, the evidence presented by the premises underdetermines the truth-value of the conclusion. Once again, though, this is a claim about the character and limits of logical inference, not reasoning.

To the extent that we are willing to model evidential claims and claims that constitute the conclusions of deductive implications in formal logical systems, it is possible to give system-relative, precise syntactic and semantic characterizations of the concept of deductive validity. For example, suppose that A 1 ...A n-1 , A n is a sequence of well-formulated formulae in a formalized logical language L, where A 1 ...A n-1 are the premises and A n is the conclusion. We can then say that A 1 ...A n-1 , A n is (syntactically) valid in L "just in case A n is derivable from A 1 ...A n-1 , and the axioms of L, if any, by the rules of inference of L" [ 94 ]. Analogously, we can say that A 1 ...A n-1 , A n is (semantically) valid in L "just in case A n is true in all interpretations [models] in which A 1 ...A n-1 are true" [ 94 ]. Of course, this kind of technical sophistication raises an immediate problem if one believes that satisfactions of Bradford Hill criteria are deductively related to a causal claim. The instances of criteria satisfaction, as well as the causal claim functioning as the conclusion, must be "appropriate" instantiations of well-formed formulae in a formalized logical language L. However, except for small, artificially regimented fragments of natural languages, the project of modeling complex natural languages into an underlying formalized logical language (a problem in logic, not linguistics [ 95 ]) has met with mixed success and no consensus. The point, then, is that if one holds that satisfactions of the Bradford Hill criteria (validly) deductively support a causal claim, it seems unlikely that it is this highly formalized conception of deductive validity that is at work.

Still, perhaps one could try to use a more informal characterization of a valid deductive inference and say that as long as all the Bradford Hill criteria were satisfied in some acceptable way, they would guarantee the truth of the causal claim. However, by giving up the formalized conception of deductive validity, we have also given up the utility of this more loosely characterized sense of deductive validity. To see why, suppose we let B 1 ...B 9 represent each of the Bradford Hill criteria, and suppose that C represents the causal claim. On the more informally characterized sense of deductive validity, we want to say that on a non-formal construal of the criteria that permits us to determine whether each of the nine criteria are satisfied, and so true, if each of B 1 ...B 9 is true, then C must be true as well. It is not enough to say simply that C is true (as opposed to must be true), since C could be true for reasons that have nothing to do with each of or all of B 1 ...B 9 . However, what is it about each of or all of B 1 ...B 9 being true that necessitates C being true? It cannot be because of the syntactic characteristics of well-formed formulae in a formalized logical language since we have already given up this characterization of deductive validity. Importantly, it also cannot be because every model in which each of the B's in B 1 ...B 9 is true, is also a model in which C is true, since the specification of models requires adopting the formalized conception of validity [ 96 ] that we have given up. Thus, there is no useful sense in which the truth of a causal claim can be "clinched", deductively, by the satisfaction of the Bradford Hill criteria.

Now, implicit in the discussion to this point is the assumption that the relationship between the Bradford Hill criteria and a causal claim is that if the criteria are all satisfied, then the causal claim is true. This is an argument structure known as affirming the antecedent ( modus ponens ), and captures the idea that the satisfaction of the Bradford Hill criteria confirms the truth of the causal claim. However, instead of using this argument structure, we could adopt a broadly Popperian perspective and, instead, use the argument structure of denying the consequent ( modus tollens ) [ 97 ]. If we do this, we have moved from a deductivist account of confirmation to a deductivist account of falsification. By doing this, we could say that what matters is not whether the Bradford Hill criteria are satisfied, but whether the criteria are not satisfied. In other words, our argument now has the form that if a particular causal claim, C, is true, then the Bradford Hill criteria, B 1 ...B 9 , are satisfied, and if it is not the case that B 1 ...B 9 are satisfied, then C is false. Rather than finding out what causal claims are true, by falsifying the Bradford Hill criteria (i.e., by finding that it is not the case that the Bradford Hill criteria are satisfied), we discover which causal claims are false.

However, right away there are problems. First, the expression "it is not the case that B 1 ...B 9 are satisfied" is ambiguous. It could mean either that none of the B 1 ...B 9 are satisfied, or that at least one of the B 1 ...B 9 is not satisfied. The former seems an unlikely interpretation since one of the Bradford Hill criteria is that in a cause-effect relationship, the cause temporally precedes the effect. Arguably, for almost all cases of cause-and-effect relationships in epidemiology, health services research and the social sciences, this will be true [ 22 , 88 ]. Thus, for all but the most extraordinary cases, at least one of the B 1 ...B 9 is satisfied, thereby undermining the deductive inference that the causal claim, C, is false. Second, recall that one of the Bradford Hill criteria is strength of analogy. Analogies are inductive arguments, and so vary along a continuum in terms of their strength [ 98 ]. It follows from this that the B in B 1 ...B 9 that corresponds to the Bradford Hill criteria of analogy will never be entirely satisfied (unless the analogy is actually an identity) and never entirely dissatisfied (unless there are absolutely no shared properties or characteristics). If we count any degree of satisfaction as sufficient for purposes of claiming that the criterion is satisfied, then we have a problem analogous to that posed by the criterion of the cause preceding the effect. If we try to set some threshold limit for satisfaction, then the assessment of whether the criterion is satisfied seems ad hoc .

All this would seem to lead to saying that "it is not the case that B 1 ...B 9 are satisfied" means that there is some proper subset of B 1 ...B 9 none of whose members is satisfied. However, this leads to the possibility of very different assessments of the same causal claim. For example, suppose that the causal claim in question is C, and one person claims that the relevant conditional in the falsificationist inference is "If C then B 1 ...B 3 ", while another person claims that the relevant conditional is "If C then B 4 ...B 9 ". Further, suppose that each of B 1 ...B 3 is false while none of B 4 ...B 9 is false. In this case, the first person concludes that the causal claim, C, is false, while the second person claims that there is no justifiable reason to hold that the causal claim is false (and may, in fact, hold the causal claim to be true because it has not been falsified). Although not strictly inconsistent with one another (the failure to falsify a claim does not entail that the claim is true), the two claims are quite different and, at least in a public health context, could lead to the adoption of very different policies. One obvious way to resolve the dispute would be to provide some kind of justification that supports the use of one of the proper subsets of Bradford Hill criteria but not the other. This tact, though, raises its own problems. First, the problem is not simply that we have to choose between two contenders. What we must do is to choose amongst all possible contenders (e.g. there is also the contender of B 3 ...B 5 ). Second, what kind of justification would suffice for choosing one proper subset of Bradford Hill criteria instead of another? The aim of the Bradford Hill criteria, on the falsificationist deductivist account, was to permit us to exclude causal claims as false. Now, though, it appears that we need criteria for the criteria, and that we need to specify the relationship (possibly deductive, though this seems to raise the same problems all over again) of those new criteria to the Bradford Hill criteria that we want to retain. Thus, treating "it is not the case that B 1 ...B 9 are satisfied" as meaning that there is some proper subset of B 1 ...B 9 , none of whose members is satisfied, seems to be no resolution to the problems associated with treating the relationship between the Bradford Hill criteria and a causal claim as one of deductive entailment.

Finally, and more broadly, regardless of the interpretation given to the expression "it is not the case that B 1 ...B 9 are satisfied", there seem to be problems associated with interpreting the criteria themselves since, as Rothman et al. claim, there are ambiguities, fallacies and vagaries in each of the Bradford Hill criteria [ 99 , 100 ]. For example, regarding the criterion of analogy, Rothman et al. write that whatever "insight might be derived from analogy is handicapped by the inventive imagination of scientists who can find analogies everywhere. At best, analogy provides a source of more elaborate hypotheses about the associations under study; absence of such analogies only reflects lack of imagination or experience, not falsity of the hypothesis" [ 100 ]. They conclude, based on similar kinds of analyses of the other eight Bradford Hill criteria, that "the standards of epidemiologic evidence offered by Hill are saddled with reservations and exceptions" [ 100 ]. When considered in toto , these sorts of problems with treating the relationship between a causal claim and satisfactions of the Bradford Hill criteria as either a confirmationist or a falsificationist deductive relationship support the view that we need to find a different account of the relationship.

As noted earlier, the typical division of logical inferences is into deductive and inductive inferences. Thus, because there are good reasons to reject the view that the relationship between satisfactions of the Bradford Hill criteria and the causal claims they purport to justify is a deductive relationship, the obvious conclusion to draw is that the relationship must be inductive. Since strong inductive inferences, in contrast to valid deductive inferences, make it improbable, but not impossible, that the conclusion of an inductive argument is false given that the premises (evidential statements) are true, then understanding the relationship between satisfactions of Bradford Hill criteria and a causal claim seems consonant with what Bradford Hill claimed about the criteria ("aspects of association"). For example, Bradford Hill writes:

What I do not believe – and this has been suggested – is that we can usefully lay down some hard-and-fast rules of evidence that must be obeyed before we accept cause and effect. None of my nine viewpoints can bring indisputable evidence for or against the cause-and-effect hypothesis and none can be required as a sine qua non [ 67 ].

As suggested by this quotation, Bradford Hill did not conceive of the satisfaction of the "aspects of statistical association" as sufficient conditions (singularly or jointly) for justifying a claim that a specific association was a causal relation [ 101 , 102 ]. Moreover, with the possible exception of the temporal priority of a cause to its effect, he did not conceive of the satisfaction of the "aspects of statistical association" as necessary conditions (singularly or jointly) for a specific statistical association being a causal relation. Based on this, it seems reasonable to conclude that Bradford Hill's own understanding of the criteria is consistent with the view that the support their satisfaction offers to a causal claim is something less than that their satisfaction deductively entails the truth of a causal claim. This is certainly consonant with many writers who advocate, use or discuss the Bradford Hill criteria. For example, Russo and Williamson write that "while these criteria were intended as a guide in assessing causality, they do not ensure causality with certainty" [ 103 ], while Kundi writes that satisfaction of the Bradford Hill criteria are non-conclusively supportive of a causal claim "but cannot be used to dismiss the assumption of a causal claim" [ 104 ]. Similarly, in studies that use Bradford Hill criteria, at least some qualify their conclusions by claiming that the statistical associations are "likely to be causal" [ 45 ], that the evidence provided by the criteria's satisfaction underdetermines the truth of the causal conclusion [ 105 ], or that satisfaction of the criteria only decreases the likelihood that statistical association is not causal [ 106 ]. To sum up, there are good reasons for understanding the satisfaction of Bradford Hill criteria as inductively justifying a causal claim, which amounts to claiming that the criteria, to use Cartwright's useful expression, merely vouch for the truth of a causal claim without offering any assurance of its truth [ 37 , 107 ].

Before assessing the use of satisfactions of Bradford Hill criteria as evidence in an inductive inference, we need to be clearer about what it means to call an inference an inductive inference. As Bird notes [ 108 ], there are two distinct senses of what it means to be an inductive inference that are often confused. Although both agree that inductive inferences, unlike valid deductive inferences, are ampliative, they differ in their specificity and precision. On the one hand, inductive inferences are those kinds of ampliative inferences in which the premises are specific (usually empirical) statements, and the conclusion is a general statement. For example, although, regrettably, conflating the logical and psychological conceptions of induction, Rothman writes that the "method of induction starts with observations on nature. To the extent that the observations fall into a pattern, the observations are said to induce in the mind of the observer a suggestion of a more general statement about nature" [ 109 ]. The classic example of this kind of inductive inference is enumerative induction, which has the general form that from the fact that all observed A's are B's, we may infer that it is not probable that all A's (or some percentage of A's larger than the percentage observed) are B's is false. On the other hand, there is a broader meaning of inductive inference. According to this broader meaning, an inductive inference is any logical inference that is not deductively valid inference where, if the inference is a strong one, "it is improbable, given that the premises are true, that the conclusion is false" [ 93 ]. There are at least two reasons for preferring the latter to the former meaning of inductive inference. First, not all traditionally acknowledged examples of inductive inference fit the model exemplified by enumerative induction. For example, the inference from the sun having risen every morning in recorded history to the conclusion that the sun will rise tomorrow is an inductive inference from a general premise to a particular conclusion [ 110 ]. Second, the broader meaning of inductive inference permits us to separate more clearly the logical sense of inference from the psychological sense of inference. While assertions about inductive inferences express the speaker's beliefs, they are not, as noted, by Maher, " about the speaker's beliefs" [ 111 ]. Moreover, the broader meaning of inductive inference includes, when attention is restricted to the logical sense of inference, the narrower meaning of inductive inference as an inference from particular premises to a general conclusion. For these reasons, the following analyses use the second, broader meaning of inductive inference.

Let us suppose that B 1 ...B 9 represent the nine Bradford Hill criteria and that C represents a causal conclusion. On the assumption that each of B 1 ...B 9 is satisfied and so true, then B 1 ...B 9 strongly inductively supports C just in case it is improbable that C is false. However, the natural question to ask at this point is whether it is, in fact, true that if each of B 1 ...B 9 is satisfied, and so true, then it is improbable that C is false. This is a form of what is sometimes known as the "problem of induction." More generally, the problem, as has been long recognized, is to state precisely what it is about a set of conditions that guarantees that when those conditions are satisfied, this satisfaction makes it improbable that the associated conclusion is false. If we cannot identify what it is about the conditions that guarantee this result, then there will be no way to distinguish strong inductive inferences from weak inductive inferences. Indeed, it was Hume's inability to identify what it is about what he called the "experimental method" that guaranteed the improbability of inferred conclusions being false that led him to treat the problem of inductive inference as a problem of human psychology. For Hume, there is no logical sense of inductive inference; inductive inferences are all psychological inferences [ 112 , 113 ].

The works of Rudolf Carnap illustrate one approach to making sense of the logical conception of inductive inference. Because, according to Carnap, "the fundamental concept of inductive logic is probability [ 114 ]", he begins by drawing a distinction between what he calls the logical sense of probability, understood as "degree of confirmation", and the empirical concept of probability (statistical probability), understood as "the relative frequency in the long run of one property with respect to another" [ 115 ]. Based on this distinction, Carnap writes that the goal of inductive logic is to "measure the support which the given evidence supplies for the tentatively assumed hypothesis" [ 115 ], where the support is formalized in terms of "degree of confirmation", and so, logical probability. In the case of the Bradford Hill criteria, this means that, from the Carnapian point of view, what inductive logic should do is the measure the support that satisfactions of the criteria provide for the causal claim hypothesized as a possibility based on an already identified statistical association. Since the relevant conception of probability is logical probability, to accomplish this task, Carnap believed that it is necessary to characterize inductive logic "like deductive logic ... [as] a branch of semantics. [ 115 ]" This understanding of inductive logic raises at least three different problems. First, it requires a precise, "rational reconstruction" of the satisfactions of the Bradford Hill criteria, and the causal conclusion, as appropriate instantiations of well-formed formulae within a logical system where the rules of inductive logical inference are defined. This mirrors the requirement, considered earlier, for treating the relationship between satisfactions of the Bradford Hill criteria and a causal claim as a valid, deductive relationship. Making certain that (claims about) the applications of the Bradford Hill criteria are "appropriate" instantiations of well-formed formulae in the theory of inductive inference is a necessary condition for validating, within the inductive theory, the claim that satisfactions of the Bradford Hill criteria inductively support the inferred conclusion [ 114 ]. As such, the same kinds of problems associated with the identification and translation of natural language sentences into well-formed formulae in the case of treating the relationship between satisfactions of the Bradford Hill criteria and a causal claim as a deductive relationship occur here as well.

Second, even assuming that there is an acceptable solution to the problem of providing the appropriate rational reconstructions, there is still the problem of validating the inductive inference rules that constitute the system of inductive logic into which the satisfactions of the Bradford Hill criteria and conclusion have been translated. This is the problem of the justification of induction. Although there are many formulations of the problem, one way to formulate it is to take advantage of Carnap's claim that inductive logic, like deductive logic, is a branch of semantics. Thus, if A 1 ...A n-1 , A n is a sequence of well-formulated formulae in a formalized logical language L, where A 1 ...A n-1 are the premises and A n is the conclusion, then A 1 ...A n-1 , A n is (semantically) inductively strong in L just in case it is improbable that A n is false in all interpretations [models] in which A 1 ...A n-1 are true. The problem, then, is whether there are any inductive inference rules whose adoption is consistent with the semantic conception of an inductively strong argument [ 116 ]. It is true that one obvious kind of response to this would be to say that if an inference rule, R, is, in all observed instances of application, consistent with the semantic conception of an inductively strong argument, then we are justified in using the inference rule. However, as should be obvious from this formulation, this response is tantamount to using a kind of inductive inference to justify the inference rule R. In this case, though, the problem remerges when asked to justify this additional inference rule, and an infinite explanatory regress threatens the entire account. Although there are other approaches to justifying induction (e.g. the pragmatic justification originated by Reichenbach [ 117 ] and the analytic justification suggested by Harré [ 118 ]), "none has received widespread acceptance. [ 119 ]"

The third problem is one that, following a suggestion by Hempel, we might call "the problem of desiderata" [ 120 ]. This is the problem that in any inductive determination of the degree of confirmation conferred on a conclusion from premises assumed to be true, it is not enough to take into account only the information provided by the premises. Hempel frames the problem by asking the following question:

On the basis of different sets of statements that we consider as true, a given hypothesis h ... can be assigned quite different probabilities; which of these, if any, is to count as a guide in forming our beliefs concerning the truth of h and in making decisions whose outcome depend on whether h is true? [ 121 ]"

According to both Hempel and Carnap, to answer this question requires the adoption of a principle known as "the requirement of total evidence." As noted by Carnap, the requirement of total evidence says that in any inductive inference, "we have to take as evidence ... the total evidence available to a person in question at the time in question, that is to say, his total knowledge of the results of his observations" [ 122 ]. The requirement of total evidence is not a requirement of the formal inductive system of logic but is, instead, "a maxim for the application of inductive logic" [ 123 ]. While it may seem simple enough to incorporate this requirement, its adoption (even ignoring the problems of formalization already faced by treating applications of Bradford Hill criteria to support a causal claim as inductive inferences) has at least two unwelcome consequences. First, it means that all inductive inferences are relative to the knowledge possessed by the person making the inferences. Thus, all assessments of inductive inferential strength require a full accounting of the relevant background information, and consequently entail that we need some means of assessing amounts and kinds of information. Second, and more worrisome, the requirement seems to lead to the "new riddle of induction" identified and described by Nelson Goodman [ 124 ]. The problem, put briefly, is that once the need for such information is conceded, no matter what additional information is provided, that evidence, together with the evidence provided by the other statements assumed true, from which an inductive inference to a conclusion is drawn, underdetermines what conclusion it is permissible to draw. The threat, then, is that any set of inductive inferential rules strong enough to justify claiming that a statistical association is a causal relation will permit too much. There is no principled way to say that the application of a set of inductive inference rules, together with an assumption that a set of premises (e.g. applications of Bradford Hill criteria) are true and a specification of the "total evidence" available, will justify inductively inferring a single conclusion as opposed to a myriad of other conclusions [ 125 ].

Still, perhaps we can successfully accomplish in the case of inductive inferences what we could not in the case of deductive inferences. In particular, maybe we can weaken (make less formal) the characterization of what a strong inductive inference is in a way that permits us to use satisfactions of the Bradford Hill criteria to justify, in some looser inductive sense, a causal claim. One possibility along these lines is to say that although they are not rigid criteria whose satisfaction is required for making a justified causal inference, applications of the criteria "still give positive support to inferences about causality" [ 126 ], and one can compare the results of commensurate applications of the criteria to one another. There are two key ideas at work here. The first is that while no satisfactions of any of the criteria are, singularly or jointly, necessary or sufficient for justifying the claim that a statistical association is a causal claim, the satisfactions of one or more of the criteria provide at least some informal inductive support to the claim that a statistical association is a causal relation. The second key idea is that there is no specific requirement for "rational reconstruction" of the satisfactions of the Bradford Hill criteria or the causal conclusion into a formalized language within which precise characterizations of the inductive inferences exist. Instead, there is a much looser idea at work. Regardless of how we assess whether or not, and to what degree the Bradford Hill criteria are satisfied, as long as there are consistent assessments of applications of the Bradford Hill criteria we can create ordinal rankings of sets of assessments. For example, on the assumption that the strength of a dose-response is an indicator of the presence and strength of a biological gradient, then in the case where there are two statistical associations to the same event, the statistical association having the stronger dose-response provides the greater positive support to its claim that the statistical association is a causal relation [ 74 , 87 , 126 – 129 ].

While this avoids some of the problems associated with a more formal characterization of inductive inferences and inductive inferential rules, there are at least three problems with this interpretation of the inductive support provided by satisfactions of the Bradford Hill criteria. First, to the degree that Rothman et al. are correct that "the standards of epidemiologic evidence offered by Hill are saddled with reservations and exceptions" [ 100 ], it will be, at best, difficult to quantify the satisfactions of the criteria to assess degrees of confirmation. Without the ability to quantify the satisfactions of the criteria, the only reference cases against which it seems possible to measure the degree of confirmation are the null case, where no criteria are satisfied, or the singleton case where the one possible sine qua non criterion, temporal priority, is satisfied. Although some writers believe that "it is relatively straightforward to describe the conditions" under which the criteria are "clearly not satisfied" [ 22 ], using the null case to make comparisons permits too much. The comparison would lead to claiming that any satisfaction of one or more of the criteria is evidence of a causal connection, without permitting any comparison among the cases in which one or more of the same criteria are satisfied. For example, suppose that there are three statistical associations, where commensurate applications of the Bradford Hill criteria to all three results in saying that the first two associations satisfy the same five criteria while the third satisfies only four of the five criteria satisfied by the first two. What can we conclude? If there is no way to quantify the satisfactions of the Bradford Hill criteria, all we can conclude is that the inferences that the first two statistical associations are causal relations are stronger than the inference that the third statistical association is a causal relation. There is, though, no way to make any comparative assessment of the first two statistical associations. The ordinal ranking of satisfactions of the Bradford Hill criteria, in this case, seems too coarse grained to be of much practical value.

The second problem is that even if we can assess degrees of confirmation in a manner that permits a more fine-grained ordinal ranking (and so avoiding the first problem), all causal claims will be relative to other causal claims for which one has good reasons for believing that they have less confirmation. Causal claims are never claims simpliciter , but rather are always claims relative to one or more other possible contenders. Using causal criteria to assess whether a statistical association is a causal relationship is, to vary a remark by Rosenberg, "always a comparative affair" [ 130 ]. It only makes sense to say that a particular causal claim, C, "is more or less well confirmed by the evidence" relative to the criteria than is causal claim C*, not that C is confirmed, relative to the causal criteria, "in any absolute sense". Thus, imagine that one wonders whether a particular factor (or event), X, that is statistically associated with another factor (or event), Y, is a cause of Y. On this interpretation of the Bradford Hill criteria, the answer is never "yes" or "no", but only "yes" or "no" relative to other possible causes of Y. For example, suppose that the probability of X being a cause of Y, given some measure of the satisfaction of the Bradford Hill criteria, is greater than the probability of some other X* being a cause of Y, given some commensurate measure of the satisfaction of the criteria. It follows from this that we can say that, compared to X*, we are justified in asserting that X is the cause of Y. However, it is important to recognize the limits of this kind of claim. While it may seem that we are led to say that X is the cause of Y while X* is not, that is not correct. Instead, on the assumption that the probability assessment can be made, the most that we can assert is that the causal influence of X on Y is greater than the causal influence of X* on Y. On this account, we can rule out X* having a causal influence on Y only if X* satisfies none of the causal criteria we use to make the causal claim. Thus, except for the limiting case in which none of the criteria is satisfied, the conclusion appears to be that all statistical associations that satisfy Bradford Hill criteria are, to a greater or lesser extent, causal relationships. From the worry of not being able to identify any causal relations, we have slipped to the other extreme of finding too many causal relations; all statistical associations are causal relations, though of varying degree.

The third problem is an extension of the second problem. Suppose that B 1 ...B 9 refer to each of the nine Bradford Hill criteria. Moreover, suppose that we have a statistical association between X and Y, and so wonder whether the claim that X causes Y is justified. To take a simple example, suppose that we know that smoking is statistically associated with cancer, and we wonder whether smoking causes cancer. On the present proposal, what we would do, presumably, is to examine whether the relationship between smoking and cancer satisfies the Bradford Hill criteria. Thus, we could examine how plausible it is to suppose that there is a biological relation relationship between smoking and the cancer in question, we could examine whether the relationship between smoking and cancer has been "repeatedly observed by different persons, in different places, circumstances and times" [ 67 ], and so forth. As Weed notes, in cancer epidemiology, the most likely choice of Bradford Hill criteria to use are "consistency, strength, dose response and biological plausibility, leaving behind coherence, specificity, analogy and (interestingly) temporality" [ 131 ]. Of course, even by examining all these satisfactions of the Bradford Hill criteria, nothing immediately follows. Because, on this interpretation of the inductive support that satisfactions of Bradford Hill criteria give to a causal claim, assessments of whether a statistical association is a causal relationship are always relative to alternative assessments, we need additional possible causal claims against which to assess the current application of the criteria. What other possible claims should we consider?

One possibility is to say that we should compare the current causal claim against the claim that no causal relationship between smoking and cancer is present. Recall, though, that we make applications of the Bradford Hill criteria only to existing (recognized) statistical associations. Therefore, since the claim that no causal relationship between smoking and cancer is present is, in the limiting case, the claim that there is no statistical association between smoking and cancer, it follows that the limiting case is, de facto , ruled out by the presence of the statistical association. This means that we still need another statistical association involving cancer as a "cause" to which we can apply the Bradford Hill criteria and compare the results of those applications to the application of the criteria to the statistical association of smoking and cancer. Since all smoking is an activity associated with many other activities of life, then the obvious choice is to examine whether there is a statistical association between one or more of those other activities of life and cancer. If so, then we can apply the Bradford Hill criteria to those other associations and thus be in a position to make the kind of comparative assessment required by this understanding of the role of the Bradford Hill criteria. It is precisely here that the problem occurs. There is going to be a very large number of statistical associations that we could subject to evaluation by use of the Bradford Hill criteria. Some, such as drinking coffee or consuming alcoholic beverages, present themselves as obvious candidates, while others, such as waking up in the morning, seem to be rather silly. Curiously, it is the silly possibilities that pose the problem. The statistical association between waking up in the morning and cancer may make it a silly candidate for applications of the Bradford Hill criteria to form the appropriate contrasts, but what makes it silly? One might say that what makes it silly is the strength of the statistical association, but of course, this is itself one of the Bradford Hill criteria, and so it follows that this method of demarcation is using one of Bradford Hill criteria to rule out applications of the other criteria.

The question now shifts to what it is that justifies this use of the Bradford Hill criterion (the criterion of statistical strength) as opposed some different criterion or set of criteria. The problem is analogous to the "problem of induction" raised earlier. Either we have some other criteria though whose use we justify applying the full range of Bradford Hill criteria to a statistical association, or we do not. If we do, then we have the problem of justifying the application of these new criteria, which seems to threaten the same kind of explanatory regress considered earlier. If we lack some other criteria, then either the choice to take only some and not all statistical associations seriously is ad hoc , or else, to be consistent, we need to evaluate all the statistical associations. In the former case, there is no basis for resolving disagreements between choices of which statistical associations to subject to evaluation by applications of the Bradford Hill criteria. You choose one set of statistical associations and I choose another, and (apart from a way of adjudicating different theories incorporating different causal claims) that is the end of the matter. Although this state of affairs appears to reflect Susser's observation that in the case of judgments about causality, "there are no absolute rules, and different workers often come to conflicting conclusions" [ 64 ], it is difficult to understand why, even if true in practice, one would embrace this as a welcome entailment of a theory of causal inference. In the latter case, the requirement to test all the statistical associations is, except for very narrowly defined and artificial cases, practically impossible.

Suppose, though, that we somehow agree (and that our agreement is, in some sense or another, "justified") on a set of alternative statistical associations to which we will apply the Bradford Hill criteria. To keep matters simple, imagine that we have agreed that there are only two statistical associations to assess, and that X-Y is the first statistical association while X*-Y is the second statistical association. Since we have agreed to assess both, we apply the Bradford Hill criteria to the two associations (where the applications are commensurate to one another) and report the results. In the first case, by applying the criteria we discover that we have measures for six of the nine criteria, while in the second case, we have measures for only five of the nine criteria. In addition, we discover that there is information on an application of at least one criterion in each of the two sets for which information in the other set does not exist. Using B 1 ...B 9 to represent the nine criteria (with no correspondence to the order in Bradford Hill's presentation intended), we have information on the satisfaction of B 1 ...B 6 in the first case, while in the second case we have information on the satisfaction of B 3 ...B 7 . The problem is that because different sets of Bradford Hill criteria are satisfied in the two cases, any ordinal comparison of the two applications can only be on the overlapping criteria. That may not seem so problematic in this case, but suppose that we have a third statistical association, X**-Y to which we can apply (for whatever reason) only one of the Bradford Hill criteria. In this case, to use the ordinal metric presupposed by the interpretation of the inductive character of the Bradford Hill criteria we are examining requires that we can only compare the three statistical relationships based on the application of the single Bradford Hill criterion. Notice that while some "weight of evidence" methodologies suggest otherwise [ 132 ], it will not do to say that the inability to apply a Bradford Hill criterion is the same as saying that the Bradford Hill criterion is not satisfied. After all, counterfactually, it might be true that if the criterion had been applied in one case (say the case of X-R) it would have had a higher degree of satisfaction that the degree of satisfaction in the case in which it was, in fact, applied (say X*-R). This means that when assessing applications of Bradford Hill criteria to (alternative) statistical associations, we have two options. Either we must use only those criteria applied commensurably applied to all the statistical associations, or we need some way to make assessments about the relative importance of the criteria so that having information about the satisfaction of some counts for more than lack of information about others. In the first case, we could imagine that although forced to use only one criterion, the statistical association actually strongly satisfied the other criteria, but that this was not information we could justifiably use in making the comparative assessment of statistical associations. In the second case, what Weed refers to as the problem of the "selection and prioritization of the criteria" [ 133 ], we are back to the problem of needing some additional criteria to assess the relative value of the various Bradford Hill criteria used in making an assessment about a causal claim. For reasons adduced earlier, this seems to lead once again to an explanatory regress.

At this point, we seem led to the conclusion that because there are so many difficulties associated with the use of Bradford Hill criteria, we are justified in expunging their use entirely when assessing whether there is sufficient justification to claim that a statistical association is a causal relation. Regardless of whether the causal inferences based on satisfactions of Bradford Hill criteria are deductive inference or inductive inferences, there are problems that undermine their use in justifying the claim that a statistical association is a causal relation. However, for the supporters and advocates of the Bradford Hill criteria, the situation is not so bleak as is suggested by the foregoing analyses. Recall that Bradford Hill never referred to the "causal criteria" as "criteria" but, instead, referred to them as "aspects of association", "features of consideration" and "viewpoints" [ 67 ]. Moreover, as noted earlier, writers such as Philips and Goodman [ 68 , 69 ] go to some pains to point out that the "aspects of association" that we have been referring to as causal criteria "clearly do not meet usual definitions of criteria" [ 68 ]. According to Bradford Hill, the value of the "criteria" is that their satisfaction can, "with greater or lesser strength ... help us make up our minds on the fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more likely than cause and effect? [ 67 ]" One way to interpret this claim that significantly weakens the "testing" role of the criteria is that while satisfactions of the criteria are neither necessary nor sufficient conditions for justifying claims that statistical associations are causal relations, they are, nevertheless, good "guidelines" or "rules of thumb" for how we should exercise caution when making causal claims. When inferring a causal relation from a statistical association, we should always keep the Bradford Hill criteria in mind and be conservative in the inferences we accept. On this interpretation, the role of the criteria is not to justify causal inferences, but, instead, to provide some "aids to thought", as Doll puts it [ 127 ], to follow whenever we use some other (still undecided method or methods) for justifying causal inferences.

The obvious problem that this interpretation seems to face is that if satisfactions of the criteria are neither necessary nor sufficient for justifiably claiming that a statistical association is a causal relation, then they are neither necessary nor sufficient as recommendations for how one should be cautious when making causal inferences. To take a simple example, suppose that someone decides to investigate whether a statistical association is a causal relation and, knowing the Bradford Hill criteria, we caution the person about to conduct the investigation to keep the criterion of constancy in mind when making any causal inferences from the statistical association. The person about to conduct the investigation might very well be puzzled by this and ask both how he or she should take consistency into account when considering the causal inference, and, even more generally, why consistency should be taken into account. In answering the first question, perhaps we should remember the concerns and criticisms of Rothman et al. about the Bradford criteria being "saddled with reservations and exceptions" [ 100 ]. If correct, then there is no simple, unequivocal answer to this question. Other than suggesting that the person look for instances of the statistical association in a variety of different conditions, it is not clear what can be said. While this may be helpful in some very general way, this kind of general caution is certainly not unique to the Bradford Hill criteria. The problem posed by the second question is even more severe. Since consistency is not a necessary condition for a statistical association to be a causal relation, then its absence, by itself, cannot undermine the person's causal claim. Moreover, since consistency is not a sufficient condition for a statistical association to be a causal relation, then its presence, by itself, is no guarantee that the statistical association is a causal relation. However, it is really more than this. Presumably, the idea behind treating the Bradford Hill criteria as "aids to thought" or "useful guidelines" is that their use will somehow contribute to an increased likelihood that a causal inference is a justified causal inference. The question, though, is how we are to understand this if the applications of the criteria are not themselves part of the inferential justification. It may be true that satisfaction of the criteria results in a greater likelihood that one will correctly apply whatever method one chooses to use to justify causal inferences. Unfortunately, this does not seem like a plausible interpretation. On the one hand, the criteria do not seem to be about the use of methods, but rather about statistical associations. On the other hand, even if they are "aids to thought" whose usefulness comes from constraints they place on applications of some chosen method for making causal inferences, why suppose that the method for which the Bradford Hill criteria are constraints is the (or at least a) proper method? If the method for which the Bradford Hill criteria are constraints is the "correct" method because the Bradford Hill criteria guide that inferential method "in the right way" in identifying causal relations, then, in reality, the Bradford Hill criteria are themselves criteria for making justified inferences, even though they are not the "final" criteria. Here, though, we are back to trying to make some sense of how they can serve this function in light of all the problems associated with linking them to either deductive or inductive inferences. If there is no independent reason for thinking that the method for which the Bradford Hill criteria provide constraints is the appropriate method for identifying which statistical associations are causal relations, then the Bradford Hill criteria have no utility in the project of justified causal inferences. If, though, there are independent reasons for accepting the method for which the Bradford Hill criteria provide constraints, then it is not clear what kind of constraints the Bradford Hill criteria provide. It would seem that applications of the Bradford Hill criteria are, in this case, independent of the chosen method for justifying causal inferences, and so provide no real constraints at all. Thus, either the criteria have very little or no use as meta-methodological criteria, or their use presupposes that they really are, in some way or another, criteria whose use will provide some kind of justification for causal inferences.

At this point, let us backtrack a bit. Suppose that we do concede that even as aids to thought, satisfactions of the Bradford Hill criteria do, in some sense, justify causal inferences and the causal conclusions of those inferences. The objection to this was that the foregoing analyses have demonstrated that there are many difficulties associated with using the criteria, regardless of whether we look at their possible role in deductive or inductive inferences. However, what is important to bring out is an implicit assumption at work in this objection. The implicit assumption is that all logical inferences are either deductive or inductive (or some combination), and that this dichotomy is an exhaustive one. It is certainly true, as remarked earlier, that this is a traditional and widely held view about the nature and character of logical inferences. As it happens, though, the assumption does not appear to be true. Having its roots in C.S. Peirce's account of abduction (or what he later called retroduction), there is a third kind of logical inference that, since the middle 1960s, has played "an enormous role in many philosophical arguments and, according to its defenders, an essential role in scientific and common-sense reasoning" [ 134 ]. This third kind of logical inference is called "inference to the best explanation" [ 135 , 136 ], and it is here, I believe, that we can find a defensible role for the Bradford Hill criteria.

As noted by Thagard, in "his writings before 1890, Peirce classified arguments into three types: deduction, induction, and hypothesis" [ 137 ]. However, by the early years of the twentieth century, Peirce had substituted "abduction" for "Hypothesis", and would later substitute "retroduction" for "abduction". For example, in an April 1903 lecture delivered at Harvard University, Peirce said that there are three different kinds of reasoning – "Abduction, Induction, and Deduction" [ 138 ]. For Peirce, deductive reasoning "is the only necessary reasoning" [ 138 ] and proves that something must be" [ 139 ], and inductive reasoning "is the experimental testing of a theory" [ 138 ] that "consists in starting from a theory, deducing from it predictions of phenomena, and observing those phenomena in order to see how nearly they agree with the theory" [ 139 ]. In contrast to both deduction and induction, abduction "consists in studying facts and devising a theory to explain them, [ 138 ]" and in this way, "is the process of forming an explanatory hypothesis" [ 139 ]. Thus, for Peirce abductive reasoning is a kind of logical inference that begins with the available facts "without, at the outset, having any particular theory in view, though it is motivated by the feeling that a theory is needed to explain" the facts [ 140 ], and discovering a conjecture (hypothesis) "that furnishes a possible Explanation" [ 141 ].

In 1965, Gilbert Harman introduced the expression "inference to the best explanation" and wrote that "'The inference to the best explanation' corresponds to what others have called 'abduction"' [ 135 ]. According to Harman, in making an inference to the best explanation, "one infers, from the fact that a certain hypothesis would explain the evidence, to the truth of that hypothesis" [ 135 ]. Of course, it is likely that there will be a number of hypotheses that, to one degree or another, "explain" the evidence. What inference to the best explanation provides is a method wherein by "starting out with a set of data", we are justified in inferring what hypothesis to take seriously as a starting point for further investigations on the grounds that the hypothesis is the best (in some, to this point, undefined sense of "best") hypothesis that explains the data [ 142 ]. Sometimes, the method of inference to the best explanation is expressed counterfactually. For example, Lipton writes that we should understand inference to the best explanation as an inference in which given "our data and our background beliefs, we infer what would, if true, provide the best of competing explanations we can generate of those data" [ 136 ]. The importance of the counterfactual formulation of inference to the best explanation is that it presents the hypothetical character of the conclusion of the inference. In inference to the best explanation, what we get is a hypothetical truth rather than a conclusion guaranteed true or confirmed improbable to be false. This concurs with Peirce's claim that abduction "does not afford security" [ 141 ] and that its purpose is to create a hypothesis, explaining the data, which we must then test by the appropriate deductive and inductive inferences.

Although there is debate about whether contemporary characterizations of inference to the best explanation (IBE) fully and accurately capture the view of abduction (retroduction) to which Peirce finally came [ 143 , 144 ], there are three important characteristics of most contemporary formulations of IBE that are largely shared with various remarks in Peirce's writing. First, while the traditional characterizations of deductive and inductive inferences take place independently of characterizations of what constitutes an explanation, there is a combination of inference and explanation in IBE. As Lipton writes, far "from explanation only coming on the scene after the inferential work is done, the core idea of Inference to the Best Explanation is that explanatory considerations are a guide to inference" [ 136 ]. In a similar vein, Douven writes that advocates of IBE "all share the conviction that explanatory considerations have confirmation-theoretical import" [ 145 ]. The second characteristic of IBE shared with Peirce's conception of abduction/retroduction is that IBE is a logical inference. In the context of examining the role of the Bradford Hill criteria, this is an especially important point. The dilemma presented by the earlier analysis was that either we understand applications of Bradford Hill criteria in their role as premises in deductive or inductive causal inferences, or we understand applications of Bradford Hill criteria as having no direct role in causal inferences. Both horns of the dilemma seem to lead to unacceptable problems, but in linking applications of the Bradford Hill criteria to IBE, we grasp the dilemma by the first horn, and attempt to defuse the dilemma by identifying a role for applications of the Bradford Hill criteria in a different kind of causal inference. The third characteristic, related to the tie between inference and explanation in IBE, is that IBE is not a "logic of proof" in the sense that deductive and inductive inferences are logics of proof, but is instead a "logic of discovery" [ 146 – 149 ]. What this means is that the explanatory character of IBE entails that the inference does not simply restate information already present in the data from which it starts (as in deduction) or try to use information already present in the data to confirm the low probability that a conclusion is false (as in induction). Instead, in IBE the data provides the context for making a logical, albeit non-deductive and non-inductive, inference to a hypothesis that (best) explains the facts. In this sense, IBE "discovers" the hypothesis that best explains the data. Thus, IBE rejects Popper's claim that "conceiving or inventing a theory" does not call for "a logical analysis" and that there "is no such thing as a logical method of having new ideas, or a logical reconstruction of this process" [ 146 ]. Using a distinction drawn by Hanson, we can make the point by saying that whereas both inductive and deductive inferences provide justification for a hypothesis, IBE provides good reasons for "suggesting" a hypothesis, whose justification (in the former sense of deductive or inductive inferential inquiry) we ought to undertake [ 147 , 149 ]. Admittedly, there is some tension between advocates of IBE who insist that IBE provides reason for believing that the hypotheses resulting from applications of IBE to data are true [ 134 , 135 , 142 ] and those who believe that while the hypotheses have explanatory virtues we should refrain from calling them "true" [ 144 ]. However, the counterfactual formulation, that inference to the best explanation results in a hypothesis that, if true, would provide the best explanation, is the "middle" position capturing the important elements of both sides in the debate. Moreover, this interpretation of IBE seems best suited to distinguish clearly IBE, as a logical inference, from both deductive and inductive inferences where the (necessary or probable) truth or falsity of the conclusion is an important characteristic of the inference. Consequently, in the discussions and analyses that follow, the form of IBE used is one that incorporates the counterfactual truth-value characterization of the conclusion of the inference.

Before fleshing out some of the details, it is worth noticing that understanding the role of satisfactions of the Bradford Hill criteria in this way – as the data used in IBE – seems to sit well with at least some accounts of the role of the Bradford Hill criteria in epidemiology and health services research. For example, Kaufman and Poole write that lists of causal criteria, such as the Bradford Hill criteria, have emerged "as informal test of whether alternative explanations (e.g. confounding) are likely to exist for the hypothesis of causality" [ 70 ]. Put into the language of IBE, applications of the Bradford Hill criteria to data lead to the discovery of the most plausible (hypothetical) explanation of an observed statistical association. In a similar vein, Phillips and Goodman suggest that the Bradford Hill criteria (which they insist are not criteria at all) function informally to introduce "common sense" into the search for what causal claims to accept [ 68 ]. If "common sense" is understood as a kind of process of discovering possibilities and weeding them out, a view of common sense that, as noted by Höfler, is consistent with the philosophical tradition [ 150 ], then this view is, in important respects, similar to the view in which satisfactions of the Bradford Hill criteria play a role in IBE. In his discussion of the precautionary principle and public health, Weed makes a comment that seems to suggest that he too might be amenable to linking satisfactions of the Bradford Hill criteria to IBE. Weed writes that causal criteria are "the most commonly-used method of interpreting scientific evidence in public health", and that the criteria "are 'applied' to the available evidence after it has been collected and summarized in a systematic narrative review" [ 88 ]. If we focus on the ideas of interpretation and applications to available data, then this view, in its broad outlines, seems consonant with the idea that, in IBE, the inference is an instance of both a logic of justification (proof) and a logic of discovery. Finally, even Bradford Hill seems to have had something like the IBE role of the criteria in mind when writing about them in his Presidential Address. What Bradford Hill claimed in that address is that the satisfactions of the criteria can help us in making up our minds about the "fundamental question – is there any other way of explaining the set of facts before us, is there any other answer equally, or more, likely than cause an effect? [ 67 ]" Here, what Bradford Hill has done is to link explicitly the kind of inference supported by satisfactions of the criteria with "explaining the set of facts before us", which is precisely the kind of link IBE makes.

What, then, does it mean to place satisfactions of the Bradford Hill criteria in the framework of IBE? There are at least three important consequences of such a placement. First, and foremost, it means that satisfactions of the Bradford Hill criteria do not "justify" causal claims in the traditional sense of "justify"; satisfactions of the Bradford Hill criteria neither guarantee the truth of a causal conclusion nor make it improbable that a causal conclusion is false. It follows that studies claiming to apply "the criteria proposed by Bradford-Hill to establish causality between associated phenomena" [ 151 ] or that satisfactions of the Bradford Hill criteria "operationally" justify the existence of a causal relation [ 152 ], have seriously misunderstood the role that satisfactions of the Bradford Hill criteria play relative to causal claims. Within an IBE framework, satisfactions of Bradford Hill criteria do not justify asserting that a causal claim is true. Satisfactions of the Bradford Hill criteria do not provide "a useful tool for the assessment of biomedical causation" [ 153 ], and they do not confirm the causal efficacy of an agent (such as cancer) in the emergence of one of more symptoms [ 86 ]. Put more generally, causal criteria, within an IBE framework, are not, as Susser suggests, criteria in the "pragmatic inductive/deductive approach" whose function is to "guide the evaluation of evidence about cause" [ 154 ]. The mistake here, from the point of view of IBE, is that these claims are attempting to place satisfactions of Bradford Hill criteria in deductive or, more likely, inductive inferences. When used in IBE, applications of Bradford Hill criteria lead to the discovery of explanatory hypotheses whose explanatory power, if true, is what justifies their role as hypotheses from which further (deductive and inductive) investigations should proceed.

Even more cautious claims about the role of Bradford Hill criteria, such as that their satisfaction permits determining whether statistical associations between exposures and outcomes "are likely to be causal" [ 45 ], or that the use of the criteria is useful in reviewing the evidence in support of a causal claim [ 6 ], are likely inconsistent with the IBE understanding. Although not explicitly stated, such studies seem to make one of two (sometimes both) underlying assumptions. The first assumption is that satisfaction, to some degree, of one or more of the Bradford Hill criteria confirms the claim that a statistical association is a causal relation, while the second assumption is that the failure of those criteria to be falsified gives some reason for accepting that a statistical association is a causal relation. This contrasts with the IBE framework in which satisfactions of the Bradford Hill criteria both identify a hypothesis about a statistical association, and justify claiming that the hypothesis that the statistical association is a causal relation is, if true, the hypothesis that best explains the available data. Steinberg and Goodwin appear to come close to this view of the Bradford Hill criteria. They write that their study about alcohol and breast cancer reviewed "the available evidence regarding the association of alcohol with breast cancer" and then applied the Bradford Hill criteria to the data "to examine the existence and nature of the association of alcohol with breast cancer risk" [ 87 ]. If we replace 'examine' with 'discover', and equate discovering the nature of an association with discovering whether treating a statistical association as a causal relation is the best explanation of the statistical association, then we have something reasonably close to the idea of applying the Bradford Hill criteria in an IBE framework.

A second implication of placing the Bradford Hill criteria in an IBE framework is that the relevant inference, with the conclusion that the best explanation for a statistical association is that it is a causal relation, must begin with a body of facts (data) [ 142 ]. This is at least superficially consistent with Weed's claims that the "practice of causal inference requires a body of evidence" [ 155 ], and, with some possible qualification depending on what Weed means by "collected and summarized", that the criteria "are "applied" to the available evidence after it has been collected and summarized in a systematic narrative review" [ 88 ]. Moreover, it seems to accord well with Susser's claim that judgments about the presence (or absence) of causal relations are "reached by weighing the available evidence" [ 64 ], and with studies that apply Bradford Hill criteria to collected evidence presented in reviewed literature [ 73 , 74 , 81 – 87 ]. The important point here is that the causal claim that is the conclusion of IBE is neither a deductive nor an inductive inference from this data, but is rather an inference in the sense that it is an explanatory claim that, if true, makes the greatest sense of the data. Put a bit differently, the hypothesis generated by IBE is "justified precisely to the extent that it is shown to have explanatory power" [ 156 ], and that explanatory power is what is revealed by the satisfactions of the Bradford Hill criteria when applied to the available data.

To reiterate though, one cannot conclude that a causal claim inferentially supported by satisfactions within an IBE framework is a true causal claim or that it is improbable that the conclusion is false. What IBE permits is only the conclusion that the hypothesis that the statistical association is a causal relation is the best possible explanation, given the satisfactions of the Bradford Hill criteria by the data. What the satisfactions of the Bradford Hill criteria do is not make the causal claim true, but instead, justify the claim that the causal claim is the one that would, if true, be the most explanatory in light of the data to which the criteria were applied and the satisfactions of the criteria [ 136 ]. IBE, like Peirceian abduction from which it comes, is, in the case of causal inference, the process of adopting a causal claim "on probation". As noted by Curd, this adoption "does not mean accepting the hypothesis [causal claim] as true, or even as inductively probable, but regarding the hypothesis as a workable conjecture, a hopeful suggestion which is worth taking seriously enough to submit to a detailed exploration and testing. [ 157 ]" Contrary to Potischman and Weed, this means that even if all the Bradford Hill criteria were applied to the data and all the criteria were, to a greater or lesser degree, satisfied, nothing would follow about whether we would be in a "strong position to make a public health recommendation, as long as other (e.g. ethical) considerations were also met" [ 105 ]. This sort of claim conflates the function of IBE with induction. Unlike Harman's view of IBE according to which all inductive inferences are subsumable under the umbrella of IBE [ 135 ], the view I am presenting in this paper is that IBE is distinct from inductive inferences. On the other side of the inductive-deductive dichotomy, it is also a mistake to claim that "causal criteria can be used to critically test – through refutation and prediction – causal hypotheses of necessary causes" [ 44 ]. This conflates the function of IBE with deduction. The only logically permitted conclusion, within the IBE framework, is that we have good reason for taking seriously the hypothesis that the statistical association is a causal relation. This does not make the conclusion true or likely, or improbable to be false; it only means that it is a hypothesis that we now need to investigate further to determine whether the statistical association really is a causal relation and what causal effect, if any, there is.

The third important consequence of placing the Bradford Hill criteria in an IBE framework is that the relation of satisfactions of the criteria to the hypothesized causal claim is not a formal one. In contrast with deductive inferences and the ideals of inductive inferences, there are no formal rules of IBE. As Hanson notes, for Peirce, one of the forerunners to Hanson's "logic of discovery" and IBE, there is no "manual", no formalized set of rules, to "help scientists make discoveries" about the hypotheses that best explained the data [ 147 ]. Instead, the rules of IBE are best thought of as strategies [ 158 ] to accomplish a particular goal, viz., the goal of making explanatory sense out of the data in question, where the "explanatory sense" in question means explanations within a cause-and-effect framework. In this respect, the inferences in IBE are somewhat different from the way that Hanson characterized inferences in his "logic of discovery". As Gutting notes [ 159 ], one of the principal objections to Hanson's "logic of discovery" as well as why, for Gutting, Hanson's "analysis remains unfruitful" is that he conceived of its inferences having a logical form in the same sense that deductive and inductive inferences have a logical form. By characterizing the rules of IBE (instantiated by the Bradford Hill criteria) as strategies (regulative principles), one avoids the problems associated with treating them as formal, logical rules of inference, while, using language from Simon, retaining their "logical" status as "normative standards for judging the process used to discover" the best explanatory hypothesis [ 148 ].

It is here that one's assumptions about the "nature" of causes impacts the kinds of acceptable inferences to the best explanation. If one, pace Cartwright, believes that "there is an untold variety of causal relations" [ 107 ], then there will not be a single answer to what the "best" causal explanation is. The answer will vary with the kind of cause (or causes) in which one is interested. This fits well with a claim already attributed to Weed that, in cancer epidemiology, the most likely choices of Bradford Hill criteria to use are "constancy, strength, dose response and biological plausibility, leaving behind coherence, specificity, analogy and (interestingly) temporality" [ 131 ]. Moreover, this view gives substance to Susser's claim about the intimate connection between the use of causal criteria and the development of a "grammar for a pragmatic epidemiology" [ 154 ]. At the same time, this does not entail that the "inference" in IBE is nothing but a psychological inference. Acknowledging that IBE occurs within the context of inquiries about cause-and-effect relations whose goals and practices are broadly delimited by psychological, sociological and historical characteristics is not the same as saying that the inferences have no logical character. IBE still falls on the logical side of the logical/psychological dichotomy of inferences discussed earlier in the context of deductive and inductive inferences.

Of course, this still leaves a methodological issue unresolved and in need of further investigation. Even with a particular kind or sense of cause set as part of the background framework for our inquiry, how do we "know" whether applications of a set of criteria to the available (and relevant) data (such as the Bradford Hill criteria) really result in the best explanation? After all, if we had started out with different criteria, then it is possible that the explanation on which we settled would be a different one. Peirce's answer to this question in the case of abduction was that the end/goal of abduction is, "through subjection to the test of experiment, to lead to the avoidance of all surprise" and to the establishment of a productive way of interacting with the world [ 160 ]. We can tell an analogous story about the use of Bradford Hill criteria in IBE. What supports the use of the Bradford Hill criteria (or some weighted subset of the criteria) in IBE two-fold. First, the hypotheses discovered by satisfactions of the criteria in IBE are testable (by use of deductive and inductive inferences, where the concept of "test" is appropriate). If the hypotheses were not testable, this would give good reasons for selecting another set of criteria or differently weighting the criteria we had been using. Second, if true, the hypotheses discovered by satisfactions of the criteria in IBE successfully resolve outstanding problems we have that were the source of our inquiries into causes. Thus, the "justification", if one wants to use that word, of using Bradford Hill criteria in IBE is fallibilist and pragmatic. It is not likely that this will satisfy people who want some formal justification for using the criteria, but this kind of pragmatic justification seems entirely appropriate and sensitive to the different purposes that motivate our inquires into causes. After all, within the IBE framework, various weightings of the Bradford Hill criteria function as "causal values", in Poole's nicely captured sense, reflecting differing (though more or less shared) interests in making causal claims, differing (though more or less shared) concepts of cause, and differing (though more or less shared) standards of what counts as a causal measure [ 161 ].

Research in epidemiology and the health sciences continues to make use of criteria such as the Bradford Hill "aspects of association" in making causal inferences based on observational data. The idea of much of this research is that using satisfactions of Bradford Hill criteria justifies the causal claims that are the conclusions of such inferences. This research ranges from clinical research in pediatric nephrology [ 162 ], to the relationship between "the parenchymal pattern of the breast seen on mammographic examination and risk of breast cancer" [ 163 ], to pharmacovigilance [ 164 ]. However, as argued above, such research is ill served by the use of the Bradford Hill criteria when the inferences in which they are used are either deductive or inductive causal inferences. If correct, then what options are available for researchers wanting to make justified causal claims? One possibility is to accept a variation of Russell's 1912 claim in his presidential address to the Aristotelian Society and say that the word 'cause' is so "inextricably bound up with misleading associations" as to make its complete extrusion from the scientific vocabulary desirable [ 165 ]. A second possibility is to say that if we want truly causal claims, then we should restrict our attention to data from properly conducted randomized controlled experimental studies. However, each of these two conclusions is, in its own way, too Draconian.

Regarding the first possibility, following Cartwright, it seems that we need causal concepts to distinguish between effective and non-effective strategies [ 166 ]. To use an example by Field, although there is a high statistical correlation between smoking and lung cancer, taking an anti-cancer drug is not an effective strategy for quitting smoking, which suggests that concept of cause plays a crucial role in distinguishing effective from ineffective strategies [ 167 ]. Thus, the cost of expunging "causal talk" from the sciences would be to undermine the practical goals of science, as well as the hope of using the results of scientific inquiry to create beneficial policies and help in making sound legal decisions. Regarding the second possibility, not only would this restrict causal claims to a very narrow range of data (excluding, for example, studies that use survey data), it also assumes that properly conducted RCTs really do justify causal claims. However, as discussed previously, this assumption is subject to a variety of practical and methodological difficulties [ 30 , 41 , 42 ], not the least of which is that, as Cartwright writes, the method of randomized controlled experiments may tell us something about causal relations in the very specific circumstances of the experiment, but "tells us nothing about what the cause does elsewhere" [ 107 ].

Rather than accepting either of the possible Draconian conclusions, I have argued in this paper that there is an alternative account of the role of the Bradford Hill criteria (and of causal criteria more generally). The problems associated with the use of causal criteria are due to supposing that their satisfactions play a role in either deductive or inductive causal inferences. Given the long tradition of dichotomizing logical inferences into deductive and inductive inferences, and supposing that the dichotomy is an exhaustive one, this is a natural supposition. However, by acknowledging and understanding a kind of logical inference, crucial in the "logic of discovery", that is neither deductive nor inductive, and by placing applications of the Bradford Hill criteria in this framework, the framework of inference to the best explanation, we find a new and important role for the criteria. Applications of the criteria, with a recognition that the criteria may change in content or in the emphasis placed on individual criteria depending on the conception of cause which motivates the inquiry about causal relations, play a crucial role in the discovery and justification of what hypothetical causal claims merit further, detailed study. What kind of further study is that? Part of the value of the role of causal criteria presented in this paper is that this question remains an open one, and that the use of causal criteria complements many possible approaches that one may take to the task of justifying the claim that it is true (or false) that a statistical association is a causal relation. Satisfactions of the Bradford Hill criteria, in the IBE framework described in this paper, do not permit inferring that a statistical association is a causal relation. Instead, such satisfactions only justify claiming that, if true, the hypothetical identification of a statistical association as a causal relation is the best explanation supported by the data [ 136 , 168 ]. Thus, satisfactions of the Bradford Hill criteria in the IBE framework provide a propaedeutic to further, statistical analyses of causal claims. As an example, for those interested in using Bayesian methods [ 169 , 170 ], the information provided by satisfactions of the Bradford Hill criteria in an IBE framework may contribute to the specification of the needed prior probabilities [ 136 , 142 , 171 ]. Once applications of causal criteria in an IBE framework present us with causal hypotheses that merit further study, only careful and reflective analyses using the appropriate methodological safeguards and statistical tools will lead to justified claims about the truth or falsity of those hypotheses.

Salmon WC: Causality and Explanation New York, NY: Oxford University Press 1998.

Book Google Scholar

Arjas E: Causal Analysis and Statistics: A Social Sciences Perspective. Eur Sociol Rev 2001, 17: 59–64.

Article Google Scholar

Woodward J: Making Things Happen: A Theory of Causal Explanation New York, NY: Oxford University Press 2003.

Google Scholar

Marini M, Singer B: Causality in the Social Sciences. Sociol Methodol 1988, 18: 347–409.

Pearl J: Causal Inference in the Health Sciences: A Conceptual Introduction. Health Serv Outcomes Res Methodol 2001, 2: 189–220.

Reekum R, Streiner DL, Conn DK: Applying Bradford Hill's Criteria to Neuropsychiatry: Challenges and Opportunities. J Neuropsychiatry Clin Neurosci 2001, 13: 318–325.

PubMed Google Scholar

Levy H, Meltzer D: The Impact of Health Insurance on Health. Annu Rev Public Health 2008, 29: 399–409.

Article PubMed Google Scholar

De Vreese L: Epidemiology and Causation. Med Health Care and Philos 2009.

Mawson AR: On Not Taking the World As You Find It – Epidemiology In Its Place. J Clin Epidemiol 2002, 55: 1–4.

Rockhill B: Theorizing About Causes at the Individual Level While Estimating Effects at the Population Level. Epidemiology 2005, 16: 124–129.

Renton A: Epidemiology and Causation: A Realist View. J Epidemiol Community Health 1994, 48: 79–85.

Article CAS PubMed Google Scholar

Parascandola M, Weed DL: Causation in Epidemiology. J Epidemiol Community Health 2001, 55: 905–912.

Gordis L: Epidemiology second Edition Philadelphia, PA: W.B. Saunders Company 2000.

Fletcher RH, Fletcher SW: Clinical Epidemiology: The Essentials fourth Edition Philadelphia, PA: Lippincott Williams and Williams 2005.

Morabia A: Epidemiology: An Epistemological Perspective. A History of Epidemiologic Methods and Concepts (Edited by: Morabia A). Basel, Switzerland: Birkhäuser Verlag 2004, 3–124.

Swaen G, van Amelsvoort L: A Weight of Evidence Approach to Causal Inference. J Clin Epidemiol 2009, 62: 270–277.

Botti C, Comba P, Forastiere F, Settimi L: Causal Inference in Environmental Epidemiology: The Role of Implicit Values. The Science of the Total Environment 1996, 184: 97–101.

Article CAS Google Scholar

Gori GB: Considerations on Guidelines of Epidemiologic Practice. Ann Epidemiol 2002, 12: 73–78.

Oswald A: Commentary: Human Well-being and Causality in Social Epidemiology. Int J Epidemiol 2007, 36: 1253–1254.

Lipton R, Ødegaard T: Causal Thinking and Causal Language in Epidemiology: It's All in the Details. Epidemiol Perspect Innov 2005., 2:

Weed DL: Environmental Epidemiology Basics and Proof of Cause-Effect. Toxicology 2002, 181–182: 399–403.

Weed DL: Methodologic Implications of the Precautionary Principle: Causal Criteria. Int J Occup Med Environ Health. 2004, 17 (1) : 77–81.

Rosenbaum PR: Safety in Caution. Journal of Educational Statistics 1989, 14: 169–173.

Urbach P: Randomization and the Design of Experiments. Philos Sci 1985, 52: 256–273.

Maldonado G, Greenland S: Estimating Causal Effects. Int J Epidemiol 2003, 31: 422–429.

Greenland S, Rothman KJ, Lash TL: Measures of Effect and Measures of Association. Modern Epidemiology third Edition (Edited by: Rothman KJ, Greenland S, Lash TL). Philadelphia, PA: Lippincott Williams and Wilkins 2008, 51–70.

Reiter J: Using Statistics to Determine Causal Relationships. Am Math Mon 2000, 107: 24–32.

Suppes P: Arguments for Randomizing. PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 1982, 2: 464–475.

Greenland S: Randomization, Statistics, and Causal Inference. Epidemiology 1990, 1: 421–429.

Greenland S, Robins JM, Pearl J: Confounding and Collapsibility in Causal Inference. Stat Sci 1999, 14: 29–46.

Hernán MA: A Definition of Causal Effect for Epidemiological Research. J Epidemiol Community Health 2004, 58: 265–271.

Fisher RA: Development of the Theory of Experimental Design. Proceedings of the International Statistical Conferences 1947, 3: 434–439.

Cobb GW, Moore DS: Mathematics, Statistics, and Teaching. Am Math Mon 1997, 104: 801–823.

DeMets DL: Clinical Trials in the New Millennium. Stat Med 2002, 21: 2779–2787.

D'Agostino RB Jr, D'Agostino RB Sr: Estimating Treatment Effects Using Observational Data. JAMA 2007, 297: 314–316.

U.S. Department of Education: Identifying and Implementing Educational Practices Supported by Rigorous Evidence: A User Friendly Guide Washington, DC: Institute for Educational Services 2003, 1–4.

Machin D: On the Evolution of Statistical Methods as Applied to Clinical Trials. J Intern Med 2004, 255: 521–528.

Cartwright N: Are RCTs the Gold Standard? Biosocities 2007, 2: 11–20.

Cartwright N: Causal Powers: What Are They? Why Do We Need Them? What Can Be Done with Them and What Cannot? London: Center for Philosophy of Natural and Social Science 2007.

Silverman SL: From Randomized Controlled Trials to Observational Studies. Am J Med 2009, 122: 114–120.

Worrall J: Why There's No Cause to Randomize. Brit J Phil Sci 2007, 58: 451–458.

Papineau D: The Virtues of Randomization. Brit J Phil Sci 1994, 45: 437–450.

Smith HL: Specification Problems in Experimental and Nonexperimental Social Research. Sociol Methodol 1990, 20: 59–91.

Weed DL: On the Use of Causal Criteria. Int J Epidemiol 1997, 26: 1137–1141.

Mazlack L: Discovery of Causality Possibilities. Intern J Pattern Recognit Artif Intell 2004, 18: 63–73.

Little RJ, Rubin DB: Causal Effects in Clinical and Epidemiological Studies Via Potential Outcomes: Concepts and Analytical Approaches. Annu Rev Public Health 2000, 21: 121–145.

Rosenbaum PR: Observational Studies second Edition New York, NY: Springer 2002.

Cochran WG: The Planning of Observational Studies of Human Populations. J R Stat Soc [Ser A] 1965, 128: 234–255.

Rubin DB: The Design versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials. Statist Med 2007, 26: 30–36.

Freedman DA: Statistical Models for Causation: What Inferential Leverage Do They Provide? Evaluation Review 2006, 30: 691–713.

Morgan SL, Winship C: Counterfactuals and Causal Inference: Methods and Principles for Social Research Cambridge: Cambridge University Press 2007.

Freedman DA: Statistical Models and Shoe Leather. Sociol Methodol 1991, 21: 291–313.

Clogg CC, Haritou A: The Regression Model of Causal Inference and a Dilemma Confronting this Method. Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences (Edited by: McKim V, Turner S). Notre Dame, IL: University of Notre Dame Press 1997, 83–112.

Greenland S: Modeling and Variable Selection in Epidemiologic Analysis. Am J Public Health 1989, 79: 340–349.

Rubin DB: Estimating Causal Effects from Large Data Sets Using Propensity Scores. Ann Intern Med 1997, 127: 757–763.

CAS PubMed Google Scholar

Rubin DB: Using Propensity Scores to Help Design Observational Studies: Application to the Tobacco Litigation. Health Serv Outcomes Res Methodol 2001, 2: 169–188.

Rosenbaum PR, Rubin DB: The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 1983, 70: 41–55.

Ward A, Johnson PJ: Addressing Confounding Errors When Using Non-Experimental, Observational Data to Make Causal Claims. Synthese 2008, 163: 419–432.

Oakes JM, Johnson PJ: Propensity Score Matching for Social Epidemiology. Methods in Social Epidemiology (Edited by: Oakes JM Oakes, Kaufman JS). San Francisco, CA: Jossey-Bass 2006, 370–392.

Heckman JL: Causal Parameters and Policy Analysis in Economics: A Twentieth Century Retrospective. NBER Working Paper No. 7333 1999.

Pearl J: Causality: Models, Reasoning and Inference Cambridge: Cambridge University Press 2001.

Sprites P, Glymour C, Scheines R: Causation, Prediction, and Search second Edition Cambridge, MA: MIT Press 2000.

Freedman DA: From Association to Causation via Regression. Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences (Edited by: McKim V, Turner S). Notre Dame, IL: University of Notre Dame Press 1997, 113–161.

Susser M: Causal Thinking in the Health Sciences: Concepts and Strategies of Epidemiology New York, NY: Oxford University Press 1973.

Greenland S: An Overview of Methods for Causal Inference from Observational Studies. Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives (Edited by: Gelman A, Meng X-L). New York, NY: John Wiley and Sons, Ltd 2004, 3–13.

U.S. Department of Health, Education and Welfare: Smoking and Health: Report of the Advisory Committee to the Surgeon General of the Public Health Service Washington, DC: United States Public Health Service 1964.

Hill A: The Environment and Disease: Association or Causation? Proc R Soc Med 1965, 58: 295–300.

Phillips CV, Goodman KJ: Causal Criteria and Counterfactuals; Nothing More (or Less) than Scientific Common Sense. Emerg Themes Epidemiol 2006, 3: 5.

Phillips CV, Goodman KJ: The Missed Lessons of Sir Austin Bradford Hill. Epidemiol Perspect Innov. 2004, 1 (1) : 3.

Kaufman JS, Poole C: Looking Back on "Causal Thinking in the Health Sciences". Annu Rev Public Health 2000, 21: 101–119.

Horwin M: Sir Austin Bradford Hill. [ http://www.sv40foundation.org/Bradford-Hill.html ]

Environmental Protection Agency: Guidelines for Carcinogen Risk Assessment – EPA/630/P-03/001f Washington, DC: Risk Assessment Forum, U.S. Environmental Protection Agency 2005.

Lemen RA: Chrysotile Asbestos as a Cause of Mesothelioma: Application of the Hill Causation Model. Int J Occup Environ Health 2004, 10: 233–239.

Holt RIG, Peveler RC: Antipsychotic Drugs and Diabetes – An Application of the Austin Bradford Hill Criteria. Diabetologia 2006, 49: 1467–1476.

Franco FL, Correa P, Santella RM, Wu X, Goodman SN, Petersen GM: Role and Limitations of Epidemiology in Establishing a Causal Association. Semin Cancer Biol 2004, 14: 413–426.

Link CL, Lutfey KE, Steers WD, McKinlay JB: Is Abuse Causally Related to Urologic Symptoms? Results from the Boston Area Community Health (BACH) Survey. Eur Urol 2007, 52: 397–406.

Shakir SAW, Layton D: Causal Association in Pharmacovigilance and Pharmacoepidemiology: Thoughts on the Application of the Austin Bradford-Hill Criteria. Drug Saf 2002, 25: 467–471.

Ashby D: Establishing Causality in the Assessment of Safety of Medicine for Children. Acta Paediatr 2008.

Bennett WP, Hussain SP, Vahakangas KH, Khan MA, Shields PG, Harris CC: Molecular Epidemiology of Human Cancer Risk: Gene-Environment Interactions and p53 Mutation Spectrum in Human Lung Cancer. J Pathol 1999, 187: 8–18.

Fisher MA, Gilbert GH, Shelton BJ: Effectiveness of Dental Services in Facilitating Recovery from Oral Disadvantage. Qual Life Res 2005, 14: 197–206.

Dickerson MC, Johnston J, Delea TE, White A, Andrews E: The Causal Role for Genital Ulcer Disease as a Risk Factor for Transmission of Human Immunodeficiency Virus: An Application of the Bradford Hill Criteria. Sex Transm Dis 1996, 23: 429–440.

Köhler TS, McVary KT: The Evolving Relationship of Erectile Dysfunction and Lower Urinary Tract Symptoms. Current Sexual Health Reports 2008, 5: 9–16.

Viganò P, Somigliana E, Parazzini F, Vercellini P: Bias versus Causality: Interpreting Recent Evidence of Association Between Endometriosis and Ovarian Cancer. Fertil Steril 2007, 88: 588–593.

McCann JC, Hudes M, Ames BN: An Overview of Evidence for a Causal Relationship Between Dietary Availability of Chlorine During Development and Cognitive Function in Offspring. Neurosci Biobehav Rev 2006, 30: 696–712.

Whitrow MJ, Smith BJ, Pilotto LS, Pisaniello D, Nitschke M: Environmental Exposure to Carcinogens Causing Lung Cancer: Epidemiological Evidence From the Medical Literature. Respirology 2003, 8: 513–521.

Naschitz JE, Kovaleva J, Shaviv N, Rennert G, Yeshurun D: Vascular Disorders Preceding Disgnosis of Cancer: Distinguishing the Causal Relationships based on Bradford-Hill Guidelines. Angiology 2003, 54: 11–17.

Steinberg J, Goodwin PJ: Alcohol and Breast Cancer Risk – Putting the Current Controversy into Perspective. Breast Cancer Res Treat 1991, 19: 221–231.

Weed DL: Precaution, Prevention, and Public Health Ethics. J Med Philos 2004, 29: 313–332.

Parascandola M: Two Approaches to Etiology: The Debate Over Smoking and Lung Cancer in the 1950s. Endeavour 2004, 28: 81–86.

Kardes FR, Posavac SS, Cronley ML: Consumer Inference: A Review of Processes, Bases, and Judgment Contexts. J Consum Psychol 2004, 14: 230–256.

Goldman AI: Epistemology and Cognition Cambridge MA: Harvard University Press 1986.

Harman G: Reasoning, Meaning and Mind Oxford: Clarendon Press 1999.

Skyrms B: Choice and Chance: An Introduction to Inductive Logic Belmont, CA: Dickenson Publishing Company, Inc 1966.

Haack S: Philosophy of Logics Cambridge: Cambridge University Press 1978.

Stich SP: Logical Form and Natural Language. Philos Stud 1975, 28: 397–418.

Quine WVO: Philosophy of Logic Englewood Cliffs, NJ: Prentice-Hall, Inc 1970.

Greenland S: Induction versus Popper: Substance versus Semantics. Int J Epidemiol 1998, 27: 543–548.

Niiniluoto I: Analogy and Inductive Logic. Erkenntnis 1981, 16: 1–34.

Rothman KJ, Greenland S: Causation and Causal Inference in Epidemiology. Am J Public Health 2005, 95: S144-S150.

Rothman KJ, Greenland S, Poole C, Lash TL: Causation and Causal Inference. Modern Epidemiology third Edition (Edited by: Rothman KJ, Greenland S, Lash TL). Philadelphia, PA: Lippincott, Williams and Wilkins 2008, 6–31.

Doll R: Fisher and Bradford Hill: Their Personal Impact. Int J Epidemiol 2003, 32: 929–931.

Legator MS, Morris DL: What Did Sir Bradford Hill Really Say? Arch Environ Health 2003, 58: 718–720.

Russo F, Williamson J: Interpreting Causality in the Health Sciences. International Studies in the Philosophy of Science 2007, 21: 157–170.

Kundi M: Causality and the Interpretation of Epidemiologic Evidence. Environ Health Perspect 2006, 114: 969–974.

Potischman N, Weed DL: Causal Criteria in Nutritional Epidemiology. Am J Clin Nutr 1999, 69 (suppl) : 1309S-14S.

Dammann O, Leviton A: Inflammatory Brain Damage in Preterm Newborns – Dry Numbers, Wet Lab, and Causal Inference. Early Hum Dev 2004, 79: 1–15.

Cartwright N: Hunting Causes and Using Them: Approaches in Philosophy and Economics Cambridge: Cambridge University Press 2007.

Bird A: Philosophy of Science Montreal: McGill-Queen's University Press 1998.

Rothman KJ: Epidemiology: An Introduction New York, NY: Oxford University Press 2002.

Russell B: The Problems of Philosophy New York, NY: Oxford University Press 1959.

Maher P: A Conception of Inductive Logic. Philos Sci 2006, 78: 513–523.

Owen D: Hume's Reason New York, NY: Oxford University Press 1999.

Penelhum T: Hume New York, NY: St. Martin' Press 1975.

Carnap R: Inductive Logic and Science. Proceedings of the American Academy of Arts and Sciences 1953, 80: 189–197.

Carnap R: On Inductive Logic. Philos Sci 1945, 12: 72–97.

Haack S: The Justification of Deduction. Mind 1976, 85: 112–119.

Reichenbach H: On the Justification of Deduction. J Philos 1940, 37: 97–103.

Harré R: Dissolving the "Problem" of Induction. Philosophy 1957, 32: 58–64.

Friedman KS: Another Shot at the Canons of Induction. Mind 1975, 84: 177–191.

Hempel C: On the Cognitive Status and Rationale of Scientific Methodology. Carl G. Hempel: Selected Philosophical Essays (Edited by: Jeffrey R). Cambridge: Cambridge University Press 2000, 199–228.

Hempel C: Inductive Inconsistencies. Aspects of Scientific Explanation, and Other Essays in the Philosophy of Science New York, NY: The Free Press 1965, 53–79.

Carnap R: On the Application of Inductive Logic. Philos Phenomenol Res 1947, 8: 133–148.

Hempel C: Aspects of Scientific Explanation. Aspects of Scientific Explanation, and Other Essays in the Philosophy of Science New York, NY: The Free Press 1965, 331–496.

Goodman N: Fact, Fiction and Forecast fourth Edition Cambridge MA: Harvard University Press 1983.

Godfrey-Smith P: Goodman's Problem and Scientific Methodology. J Philos 2003, 100: 573–590.

Thygesen LC, Andersen GS, Andersen H: A Philosophical Analysis of the Hill Criteria. J Epidemiol Community Health 2005, 59: 512–516.

Doll R: Proof of Causality: Deduction from Epidemiological Observation. Perspect Biol Med 2002, 45: 499–515.

Macdonald S, Cherpitel CJ, Borges G, DeSouza A, Giesbrecht N, Stockwell T: The Criteria for Causation of Alcohol in Violent Injuries Based on Emergency Room Data from Six Countries. Addict Behav 2005, 30: 103–113.

Madjid M, Aboshady I, Awan I, Litovsky S, Casscells SW: Influenza and Cardiovascular Disease: Is There a Causal Relationship? Tex Heart Inst J 2004, 31: 4–13.

Rosenberg A: Philosophy of Science: A Contemporary Introduction New York, NY: Routledge 2000.

Weed DL: Causation: An Epidemiologic Perspective (In Five Parts). Journal of Law and Policy 2003, 12: 43–53.

Weed DL: Weight of Evidence: A Review of Concepts and Methods. Risk Anal 2005, 25: 1545–1557.

Weed DL: Evidence Synthesis and General Causation: Key Methods and an Assessment of Reliability. Drake L Rev 2005, 54: 639–650.

Day T, Kincaid H: Putting Inference to the Best Explanation in Its Place. Synthese 1994, 98: 271–295.

Harman G: The Inference to the Best Explanation. Philos Rev 1965, 74: 88–95.

Lipton P: Inference to the Best Explanation second Edition New York, NY: Routledge 2004.

Thagard PR: The Unity of Peirce's Theory of Hypothesis. Transactions of the Charles S. Peirce Society 1977, 13: 112–121.

Peirce CS: The Three Normative Sciences. The Essential Peirce: Selected Philosophical Writings, (1893 – 1913). Edited by the Peirce Edition Project Bloomington, IN: Indiana University Press 1998, 2: 196–207.

Peirce CS: The Nature of Meaning. The Essential Peirce: Selected Philosophical Writings, (1893 – 1913). Edited by the Peirce Edition Project Bloomington, IN: Indiana University Press 1998, 2: 208–225.

Peirce CS: On the Logic of Drawing History from Ancient Documents, Especially from Testimonies. The Essential Peirce: Selected Philosophical Writings, (1893 – 1913). Edited by the Peirce Edition Project Bloomington, IN: Indiana University Press 1998, 2: 75–114.

Peirce CS: The Neglected Argument for the Reality of God. The Essential Peirce: Selected Philosophical Writings, (1893 – 1913). Edited by the Peirce Edition Project Bloomington, IN: Indiana University Press 1998, 2: 434–450.

Okasha A: Van Fraassen's Critique of Inference to the Best Explanation. Stud Hist Phil Sci 2000, 31: 691–710.

Minnameier G: Peirce-Suit of Truth: Why Inference to the Best Explanation and Abduction Ought Not to Be Confused. Erkenntnis 2004, 60: 75–105.

Paavola S: Hansonian and Harmanian Abduction as Models of Discovery. International Studies in the Philosophy of Science 2006, 20: 93–108.

Douven I: Inference to the Best Explanation Made Coherent. Philos Sci 1999, 66 (Proceedings) : S424-S435.

Popper K: The Logic of Scientific Discovery New York, NY: Routledge Classics 2002.

Hanson NR: The Logic of Discovery. J Philos 1958, 55: 1073–1089.

Simon H: Does Scientific Discovery Have a Logic? Philos Sci 1973, 40: 471–480.

Hanson NR: More on "The Logic of Discovery". J Philos 1960, 57: 182–188.

Höfler M: Getting Causal Considerations Back on the Right Track. Emerg Themes Epidemiol. 2006, 3: 8.

Anzola GP, Mazzucco A: The Patient Foramen Ovale-Migraine Connection: A New Perspective to Demonstrate a Causal Relation. Neurol Sci 2008, 29: S15-S18.

Terasaki PI, Cai J: Human Leukocyte Antigen Antibodies and Chronic Rejection: From Association to Causation. Transplantation 2008, 86: 377–383.

Perrio M, Voss S, Shakir SAW: Application of the Bradford Hill Criteria to Assess the Causality of Cisapride-Induced Arrhythmia. Drug Saf 2007, 30: 333–346.

Susser M: What is a Cause and How Do We Know One? A Grammar for Pragmatic Epidemiology. Am J Epidemiol 1991, 133: 635–648.

Weed DL: Epidemiologic Evidence and Causal Inference. Hematol Oncol Clin North Am 2000, 14: 797–807.

McMullin E: Structural Explanation. Am Philos Q 1978, 15: 139–147.

Curd MV: The Logic of Discovery: An Analysis of Three Approaches. Scientific Discovery, Logic, and Rationality (Edited by: Nickles T). Dordrecht: D. Reidel Publishing Company 1980, 201–219.

Paavola S: Abduction as a Logic and Methodology of Discovery: The Importance of Strategies. Foundation of Science 2004, 9: 267–283.

Gutting G: The Logic of Invention. Scientific Discovery, Logic, and Rationality (Edited by: Nickles T). Dordrecht: D. Reidel Publishing Company 1980, 221–234.

Peirce CS: Pragmatism as the Logic of Abduction. The Essential Peirce: Selected Philosophical Writings, (1893 – 1913). Edited by the Peirce Edition Project Bloomington, IN: Indiana University Press 1998, 2: 226–241.

Poole C: Causal Values. Epidemiology 2001, 12: 139–141.

Gorman GH, Furth SL: Clinical Research in Pediatric Nephrology: State of the Art. Pediatr Nephrol 2005, 20: 1382–1387.

Goodwin PJ, Boyd NF: Mammographic Parenchymal Pattern and Breast Cancer Risk: A Critical Appraisal of the Evidence. AM J Epidemiol 1988, 127: 1097–1108.

Shakir SAW: Causality and Correlation in Pharmacovigilance. Stephens' Detection of New Adverse Drug Reactions (Edited by: Talbot J, Waller P). Chichester: John Wiley and Sons, Ltd 2004, 329–343.

Russell B: On the Notion of Cause. Mysticism and Logic and Other Essays London: George Allen and Unwin, Ltd 1949, 180–208.

Cartwright N: How the Laws of Physics Lie Oxford: Clarendon Press 1983.

Field H: Causation in a Physical World. The Oxford Handbook of Metaphysics (Edited by: Loux MJ, Zimmerman DW). Oxford: Oxford University Press 2005, 435–460.

Chapter Google Scholar

Haig BD: An Abductive Perspective on Theory Construction. J Clin Psychol. 2008, 64 (9) : 1046–1068.

Greenland S: Bayesian Perspectives for Epidemiological Research: I. Foundations and Basic Methods. Int J Epidemiol 2006, 35: 765–775.

Greenland S: Bayesian Perspectives for Epidemiological Research. II. Regression Analysis. Int J Epidemiol 2006, 36: 195–202.

Greenland S: Probability Logic and Probabilistic Induction. Epidemiology 1998, 9: 322–332.

Download references

Acknowledgements

I acknowledge and thank Professor George Maldonado and three anonymous referees for their useful comments on earlier versions of the manuscript.

Author information

Authors and affiliations.

Minnesota Population Center, University of Minnesota, 50 Willey Hall, 225 – 19th Avenue South, Minneapolis, MN, 55455, USA

Andrew C Ward

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew C Ward .

Additional information

Competing interests.

The author declares that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Ward, A.C. The role of causal criteria in causal inferences: Bradford Hill's "aspects of association". Epidemiol Perspect Innov 6 , 2 (2009). https://doi.org/10.1186/1742-5573-6-2

Download citation

Received : 11 August 2008

Accepted : 17 June 2009

Published : 17 June 2009

DOI : https://doi.org/10.1186/1742-5573-6-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Epidemiologic Perspectives & Innovations

ISSN: 1742-5573

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
HHS Author Manuscripts

Randomised controlled trials—the gold standard for effectiveness research

Eduardo hariton.

1 Department of Obstetrics, Gynecology, and Reproductive Biology, Brigham and Women’s Hospital, 75 Francis Street, Boston, MA, 02116, USA

Joseph J. Locascio

2 Department of Neurology, Massachusetts General Hospital, 15 Parkman Street, Boston, Massachusetts 02114

Randomized controlled trials (RCT) are prospective studies that measure the effectiveness of a new intervention or treatment. Although no study is likely on its own to prove causality, randomization reduces bias and provides a rigorous tool to examine cause-effect relationships between an intervention and outcome. This is because the act of randomization balances participant characteristics (both observed and unobserved) between the groups allowing attribution of any differences in outcome to the study intervention. This is not possible with any other study design.

In designing an RCT, researchers must carefully select the population, the interventions to be compared and the outcomes of interest. Once these are defined, the number of participants needed to reliably determine if such a relationship exists is calculated (power calculation). Participants are then recruited and randomly assigned to either the intervention or the comparator group. 1 It is important to ensure that at the time of recruitment there is no knowledge of which group the participant will be allocated to; this is known as concealment. This is often ensured by using automated randomization systems (e.g. computer generated). RCTs are often blinded so that participants and doctors, nurses or researchers do not know what treatment each participant is receiving, further minimizing bias.

RCTs can be analyzed by intentionto-treat analysis (ITT; subjects analyzed in the groups to which they were randomized), per protocol (only participants who completed the treatment originally allocated are analyzed), or other variations, with ITT often regarded least biased. All RCTs should have pre-specified primary outcomes, should be registered with a clinical trials database and should have appropriate ethical approvals.

RCTs can have their drawbacks, including their high cost in terms of time and money, problems with generalisabilty (participants that volunteer to participate might not be representative of the population being studied) and loss to follow up.

USEFUL RESOURCES

CONSORT Statement: CONsolidated Standards of Reporting Trials guidelines designed to improve the reporting of parallel-group randomized controlled trials - http://www.consort-statement.org/consort-2010
Link to A Randomized, Controlled Trial of Magnesium Sulfate for the Prevention of Cerebral Palsyin the New England Journal of Medicine – A well designed RCT that had a significant impact in practice patterns. http://www.nejm.org/doi/full/10.1056/NEJMoa0801187#t=abstract

LEARNING POINTS

While expensive and time consuming, RCTs are the gold-standard for studying causal relationships as randomization eliminates much of the bias inherent with other study designs.

To provide true assessment of causality RCTs need to be conducted appropriately (i.e. having concealment of allocation, ITT analysis and blinding when appropriate)

Disclosures: The authors have no financial interests to disclose

What are randomised controlled trials good for?

Open access
Published: 01 October 2009
Volume 147 , pages 59–70, ( 2010 )

Cite this article

You have full access to this open access article

Nancy Cartwright 1 , 2

31k Accesses

131 Citations

38 Altmetric

Explore all metrics

Randomized controlled trials (RCTs) are widely taken as the gold standard for establishing causal conclusions. Ideally conducted they ensure that the treatment ‘causes’ the outcome—in the experiment. But where else? This is the venerable question of external validity. I point out that the question comes in two importantly different forms: Is the specific causal conclusion warranted by the experiment true in a target situation? What will be the result of implementing the treatment there? This paper explains how the probabilistic theory of causality implies that RCTs can establish causal conclusions and thereby provides an account of what exactly that causal conclusion is. Clarifying the exact form of the conclusion shows just what is necessary for it to hold in a new setting and also how much more is needed to see what the actual outcome would be there were the treatment implemented.

Design and Analysis of Experiments

The Virtues and Limitations of Randomized Experiments

Inferential Pluralism in Causal Reasoning from Randomized Experiments

Avoid common mistakes on your manuscript.

1 Introduction

Randomized controlled trials (RCTs) are now the gold standard for causal inference in medicine and are fast becoming the gold standard in social science as well, especially in social policy. But what exactly does an RCT establish? To answer this question I turn to work from long ago by Suppes ( 1970 ), Skyrms ( 1980 ), myself (Cartwright 1983 , 1989 ) and others on the probabilistic theory of causality. Footnote 1 Given this theory plus a suitable definition of an ideal RCT, it is possible to prove trivially that from positive results in an RCT a causal conclusion can be deduced.

In the social sciences it is usual to talk about experiments or studies in terms of their internal versus external validity . A study that is internally valid is one that confers a high probability of truth on the result of the study. For instance RCTs are designed to establish causal conclusions and in the ideal, the design itself ensures that a positive result in the experiment confers a high probability on the causal conclusion. External validity has to do with whether the result that is established in the study will be true elsewhere.

I believe that the language of external validity obscures some important distinctions, distinctions that matter significantly when RCT results are offered in evidence that the cause established in the experiment will produce the desired effect in some target situation. As I see it there are two distinct issues conflated into one: 1. Do the RCT results travel as RCT results to the target situation? 2. What relevance has an RCT result in the target situation for predicting what will happen there when the cause is implemented, as it will be, without the experimental constraints and in an environment where many other causes may be at work as well? Seeing the roots of RCTs in the probabilistic theory of causality will help make these distinctions clear and reveal the strong assumptions that must be defended if external validity is to be warranted.

2 The probabilistic theory of causality

There is considerable contention about exactly how to formulate the probabilistic theory of causality. The fundamental idea is that probabilistic dependencies must have causal explanations. Take proper account of all the reasons deriving from an underlying causal structure that C and E might be probabilistically dependent. Then C and E are related as cause and effect just in case they are probabilistically dependent. The probabilistic theory takes account of the other possible causal reasons for C and E to be dependent by conditioning on some set of specially selected factors, which in my case was supposed to be a full set of causes of E (simultaneous with C or earlier than C) other than C itself. In the case of dichotomous variables, which is all I shall consider here for simplicity, this leads to the following formula: For an event-type C temporally earlier than event-type E

Here K i is a state description Footnote 2 over the specially selected factors. You will notice that K i is dangling on the right-hand-side of this formula, making it ill formed. I shall return to this below.

The idea behind the use of the partial conditional probability is that any dependencies between C and E not due to a direct causal link between them must instead be due to a correlation between both C and E and some further factor, often called a confounding factor . Conditioning on these confounding factors will break the correlation between them and anything else; any remaining dependencies between C and E must then be due to a direct causal link between them. This is a standard procedure in the social sciences in testing for causality from observational data—by ‘stratifying’ before looking for dependencies. Formally this depends on what is called Simpson’s paradox : A probabilistic dependency (independency) between two factors in a population may turn into a probabilistic independency (dependency) within each subpopulation partitioned along the values of a factor that is probabilistically dependent on the two original. Footnote 3

Skyrms’s proposal is the most directly responsive to this idea. He argued that the set of selected factors to condition on should include all and only factors with a temporal index prior to or simultaneous with C that are probabilistically dependent on E (Skyrms 1980 ). I maintained that Skyrms’s proposal would not catch enough factors (Cartwright 1983 ). Wesley Salmon had argued that a cause can decrease the probability of its effect using an example in which a strong cause and a weak cause were anticorrelated: Whenever the weak cause was present the strong cause was absent so that the probability of the effect went down whenever the weak cause was present (Salmon et al. 1971 ). One need only adjust the numbers to construct a case in which the probability of the effect is the same with the weak cause as with the strong cause. In cases like this the effect will be probabilistically dependent on neither the weak cause nor the strong cause. So neither will appear in the list of selected factors to condition on before looking for a dependency between the other and the effect and thus neither will get counted a cause under Skyrms’s proposal. Footnote 4

The only solution I have ever been able to see to this problem is to require that the selected factors for conditioning on before looking for dependencies between C and E be a full set of causal factors for E other than C, where what constitutes ‘a full set’ is ticklish to define. Footnote 5

My proposal of course is far less satisfactory than Skyrms’s. First, it uses the notion of causality on the right-hand-side in the characterisation and hence the characterisation cannot provide a reductive definition for causation. Second, a direct application of the formula seems to require a huge amount of antecedent causal knowledge before probabilistic information about dependencies between C and E can be used to determine if there is a causal link between them. The RCT is designed specifically to finesse our lack of information about what other causes can affect E. Before turning to that, however, we need some further consideration of this formula.

What it is to be done about the dangling K i ? There are two obvious alternatives. The first is to put a universal quantifier in front: for all i. This means that we will not say that C causes E unless C raises the probability of E in every arrangement of confounding factors. This makes sense just in case the cause exhibits what John Dupre called contextual unanimity : The cause either raises, lowers or leaves the same the probability of the effect in every arrangement of confounding factors (Dupré 1984 ). Where contextual unanimity fails, it is more reasonable to adopt the second alternative: relativize the left-hand-side causal claim to K i :

Probabilistic causality : C causes E in K i iff P(E/C&K i ) > P(E/¬C&K i ) and for any population A, C causes E in A iff C causes E in some K i that is a subset of A. Footnote 6

This allows us to make more specific causal judgements. It also allows us to say that C may both cause E and prevent E (say, cause ¬E) in one and the same population, as one might wish to say about certain anti-depressants that can both heighten and diminish depression in teenagers. It is especially important when it comes to RCTs where the outcomes average over different arrangements of confounding factors so that the cause may increase the probability of the effect in some of these arrangements and decrease it in others and still produce an increase in the average.

Over the years I, along with others, have noticed a number of other problems with this formula:

When a confounding factor D can be produced by C in the process of C’s producing E but can also occur for independent reasons, D should be conditioned on just in the cases where D is not part of the causal process by which C produces E (Cartwright 1989 ).

When a probabilistic cause produces two effects in tandem, the effects will be dependent on each other even once the joint cause has been conditioned on. In this case the conditioning factors for deciding if C causes E need to include a dummy variable that takes value 1 just in case C has operated to produce the paired effect and the value 0 otherwise (Cartwright 1989 , 2007 ).

If a common effect of two separate causes is ‘over represented’ in the population the two causes for that the effect will typically be probabilistically dependent. This means that the selected factors for conditioning on must not include common effects like this—so we must not condition on too much.

Sometimes quantities are probabilistically dependent with no causal explanation. The one widely recognized case of this is when two quantities both change monotonically in time. Say they both increase. Then high values of one will be probabilistically dependent on high values of the other. Vice versa if they both decrease. And if one increases and the other decreases, high values of one will be dependent on low values of the other.

A standard solution to this problem in practice is to detrend the data. This involves defining two quantities whose values at any time are essentially the values of the original quantities minus the change due to trend. This does not rescue the formula for probabilistic causality, however, unless we want further elaboration: If there is a dependence between C and E due to trend, then C causes E iff P(E’/C’&K i ) > P(E’/C’&¬K i ), where E’, C’ are new quantities defined by detrending C and E. The trick of course is to know when to detrend and when not, since a correlation in time between two monotonically changing quantities can always be due to one causing the other.

One and the same factor may both cause and prevent a given effect by two different paths. If the effect is equally strong along both paths, the effect will not be probabilistically dependent on the cause. A standard solution in practice in this case is to condition on some factor in each of the other paths in testing for a remaining path. Again, a direct application of this strategy requires a great deal of background knowledge.

Given these kinds of problems, how should the formula be amended? I think the only way is by recognizing that at this very general level of discussion we need to revert to a very general formulation. We may still formulate the probabilistic theory in the same way, but now we must let K i designate a population in which all other reasons that account for dependencies or independencies between C and E have been properly taken into account.

Nor should we be dispirited that this seems hopelessly vague. It is not vague but general. Once a specific kind of causal structure has been specified, it is possible to be more specific about exactly what features of that causal structure can produce dependencies and independencies. Footnote 7

3 RCTs and the probabilistic theory of causality

RCTs have two wings—a treatment group of which every member is given the cause under test and a control group, where any occurrences of the cause arise ‘naturally’ and which may receive a placebo. In the design of real RCTs three features loom large:

Blindings of all sorts. The subjects should not know if they are receiving the cause or not; the attendant physicians should not know; those identifying whether the effect occurs or not in an individual should not know; nor should anyone involved in recording or analyzing the data. This helps ensure that no differences slip in between treatment and control wings due to differences in attitudes, expectations or hopes of anyone involved in the process.

Random assignment of subjects to the treatment or control wings. This is in aid of ensuring that other possible reasons for dependencies and independencies between cause and effect under test will be distributed identically in the treatment and control wings; this helps deal not only with ‘other’ causal factors of E but also with the other specific problems I mentioned for formulating the probabilistic theory at the end of Sect. 2 except for the last.

Careful choice of a placebo to be given to the control, where a placebo is an item indiscernible for those associated with the experiment from the cause except for being causally inert with respect to the targeted effect. This is supposed to ensure that any ‘psychological’ effects produced by the recognition that a subject is receiving the treatment will be the same in both wings.

These are all in aid of bringing the real RCT as close as possible to an ideal RCT . Roughly, an RCT is ideal iff all factors that can produce or eliminate a probabilistic dependence between C and E are the same in both wings except for C, which each subject in the treatment group is given and no-one in the control wing is given, and except for factors that C produces in the course of producing E, whose distribution differs between the two groups only due to the action of C in the treatment wing. An outcome in an RCT is positive if P(E) in the treatment wing >P(E) in the control wing.

As before, designate state descriptions over factors in the experimental population that produce or eliminate dependencies between C and E by K i . In an ideal RCT each K i will appear in both wings with the same probability, w i . Then P(E) in treatment wing = ∑w i P(E/K i ) in treatment wing and P(E) in control wing = ∑w i P(E/K i ) in control wing. So a positive outcome occurs only if P(E/K i ) in treatment wing >P(E/K i ) in control wing for some K i . This is turn can only happen if P(E/C&K i ) > P(E/¬C&K i ) for some i. So a positive outcome in an ideal RCT occurs only if C causes E in some K i by the probabilistic theory of causality . Footnote 8

The RCT is neat because it allows us to learn causal conclusions without knowing what the possible confounding factors actually are. By definition of an ideal RCT, these are distributed equally in both the treatment and control wing, so that when a difference in probability of the effect between treatment and control wings appears, we can infer that there is an arrangement of confounding factors in which C and E are probabilistically dependent and hence in that arrangement C causes E because no alternative explanation is left. It is of course not clear how closely any real RCT approximates the ideal. I will not go into these issues here, however, despite their importance. Footnote 9

Notice that a positive outcome does not preclude that C causes E in some subpopulation of the experimental population and also prevents E in some other. Again, certain anti-depressants are a good example here. They have positive RCT results and yet are believed to be helpful for some teenagers and harmful for others. Footnote 10

4 Causal principles: from experimental to target populations

There are two immediate models for exporting the causal conclusion of an RCT to a new situation involving a new population. Reconstructing from the suppositions made and the surrounding discussion across a number of cases in different fields, I would say both fit common practice, which more often than not seems a mish-mash of the two. One is the physics model that I have developed in my work on capacities (Cartwright 1989 ): Suppose the cause has some (relatively) invariant capacity; that is, the cause always makes some fixed contribution that affects the final outcome in a systematic way. For example, the gravitational attraction on a mass m associated with a second mass M always contributes an acceleration to m in the direction of M of size GM/ r 2 , and this always adds vectorially with the contribution of other sources of acceleration acting on m.

I have been resurrecting my older work on capacities recently because something like this model often seems to be assumed, albeit implicitly, in the use of RCTs as evidence for predicting policy outcomes. RCTs are treated much like what I call ‘Galilean experiments’. These strip—or calculate—away all ‘interfering’ factors. The idea is that what the cause produces on its own, without interference is what it will contribute elsewhere. The experiment measures the contribution assuming there is a contribution to measure; that is, assuming there is a stable result that contributes in the same systematic way across broad ranges of circumstances. To know that takes a huge amount of further experience, experiment, and theorizing. RCTs can measure the (average) contribution a treatment makes—if there is a stable contribution to be measured. But they are often treated as if the results can be exported in the way that Galileo’s results could, without the centuries of surrounding work to ensure there is any stable contribution to be measured. Since I have written in detail about RCTs and capacities elsewhere (Cartwright, forthcoming a , b ), I will not pursue the topic here. I bring it up only because the logic of capacities may be implicated in the second model as well, as I shall explain in Sect. 5 .

The second model exports the causal principle established in the experimental population A directly to the new population A’ in the new situation. Under what conditions is this inference warranted? We shall need a major modification later but right now it seems reasonable to propose:

Preliminary rule for exporting causal conclusions from RCTs . If one of the K i that is a subset of A such that C causes E in K i is a subset of A’, then C causes E in A implies C causes E in A’ under the probabilistic theory of causality.

But how do we know when the antecedent of this rule obtains? Recall, the beauty of the RCT is that it finesses our lack of knowledge of what exactly the confounding factors are that go into the descriptions K 1 , …, K m . So in general we do not even know how to characterise the various K i let alone know how to identify which K i are the ones where C is causally positive, let alone know how to figure out whether that description fits some subpopulation of the target population A’.

In some situations matters are not so bad:

If the experimental population is a genuinely representative sample of the target, then all the weights w i will be the same in the target and the experimental population. Any uncertainty about whether it is truly representative will transfer to the antecedent of our rule unless there is otherwise good reason to think the specific K i ’s in which the causal principle holds are present in the target.

If the cause is contextually unanimous, the antecedent clearly applies. But since many causes are not contextually unanimous, this cannot be assumed without argument.

There is another worry at a more basic level that I have so far been suppressing. Two populations may both satisfy a specific description K i but not be governed by the same causal principles. When we export a causal principle from one population satisfying a description K i to another that satisfies exactly the same description, we need to be sure that the two populations share the same causal structure. What is a causal structure? I have argued that there are a variety of different kinds of causal structures at work in the world around us; different causal structures have different formal characterizations (Cartwright 1999 , 2007 ). Given the probabilistic theory of causality clearly what matters is this:

Two populations share the ‘ same causal structure with respect to the causal principle “ C causes E ”’ from the point of view of the probabilistic theory of causality iff the two populations share the same reasons for dependencies to appear or disappear between C and E (i.e. the same choice of factors from which to form the state descriptions K i ) and the same conditional probabilities of E given C in each. Footnote 11

Sometimes this seems easy. I have heard one famous advocate of RCTs insist that for the most part Frenchmen are like Englishmen—we do not need to duplicate RCTs in France once we have done them in England. He of course realizes that this may not be true across all treatments and all effects. And we know that superficial similarities can be extremely misleading. I carry my pound coins from Britain to the US and put them into vending machines that look identical to those in Britain, but they never produce a packet of crisps for me in the US.

Clearly the rule of export needs amendment:

Rule for exporting the causal conclusion C causes E from an RCT . If populations A and A’ have the same causal structure relative to “Causes E” and if one of the K i that is a subset of A such that C causes E in K i is a subset of A’, then C causes E in A implies C causes E in A’ under the probabilistic theory of causality.

The lesson to be learned is that although (ideal) RCTs are excellent at securing causal principles, there is a very great deal more that must be assumed—and defended—if the causal principles are to be exported from the experimental population to some target population. Advice on this front tends to be very poor indeed however. For instance the US Department of Education website teaches that two successful well-conducted RCTs in ‘typical’ schools or classrooms ‘like yours’ are ‘strong’ evidence that a programme will work in your school/classroom (U.S. Department of Education 2003 ). The great advantage of a formal treatment is that it can give content to this uselessly vague advice. From the point of view of the probabilistic theory of causality, ‘like yours’ must mean

Has the same causal structure

Shares at least one K i subpopulation in which the programme is successful.

Unfortunately, these two conditions are so abstract that they do not give much purchase on how to decide whether they obtain or not. Nevertheless, any bet that a causal principle does export to your population is a bet on just these two assumptions.

5 From causal principles to policy predictions

Consider a target population A for which we are reasonably confident that the causal principle ‘C causes E in A’ obtains. How do we assess the probability that E would result if C were introduced? Let’s take a nice case first. Suppose we have tested for ‘C causes E’ in a very good RCT where the experimental population was collected just so as to make it likely that it was representative of the target. Then we can assume that P(E) if C were introduced will equal ∑w i P(E/K i ) in treatment wing =P(E) in the treatment wing. Or can we? Not in general. We can if all three of the following assumptions are met:

C and C alone (plus anything C causes in the process of causing E) is changed under the policy. Footnote 12

C is introduced as in the experiment—the C’s introduced by policy are not correlated with any other reasons for probabilistic dependencies and independencies between C and E to appear or disappear in the target.

The introduction of C’s leaves the causal structure unchanged.

These are heavy demands.

If the RCT population is not a representative sample of the target matters are more difficult. Besides the three assumptions above we need to worry about whether the target contains the subpopulations in which C is causally positive before it will be true at all that C causes E there. That however does not ensure that introducing C, even as described in our three assumptions, will increase the probability of E since the target may also contain subpopulations where C is causally negative, and these may outweigh the positive ones. If C is contextually unanimous with respect to E this concern disappears. But, to repeat the earlier warning, contextual unanimity is not universal and a lot of evidence and argument are required to support it.

Finally, it is usual in policy settings to violate all three of the above conditions. Implementations usually change more than just the designated causal factor [e.g. in California, when class sizes were reduced, teacher quality also went down because there were not enough qualified teachers for all the new classes (Blatchford 2003 )]; the changes themselves are often correlated with other factors [e.g. people who take up job training programmes may tend to be those already more prone to benefit from them (Heckman 1991 )]; introducing the cause may well undermine the very causal principle that predicts E will result (e.g. the Chicago School of economics maintained that this is typical in economic policy Footnote 13 ).

At least with respect to the first two of these worries I notice that there is a tendency to begin to think in terms of my physics model of capacities. Even if more than C is changed at least we can rely on the influence of C itself to be positive. In this case we need to pay close attention only to changes that might introduce strongly negative factors (as in the California class-size reduction programme) or undermine the causal structure itself (like banging on the vending machine with a sledge hammer).

The assumption here is a strong version of contextual unanimity: Whatever context C is in, it always contributes positively. It is stronger even than the cases of capacities I have studied, where the assumption is that whatever context C is in, it always makes the same contribution that affects the results in the same systematic way . Consider the vector addition of accelerations contributed by various different causes, described in Sect. 3 . We are so familiar with vector addition that we sometimes forget that it is not the same as the simple linear addition supposed in saying that if C causes E in some K i then C will always contribute positively. After all, a magnet pulling up will decrease the acceleration of a falling body not increase it. In any case, the point is that the assumption that at least C itself contributes positively is a strong one, and like all the rest, needs good arguments to back it up.

RCTs establish causal claims. They are very good at this. Indeed, given the probabilistic theory of causality it follows formally that positive results in an ideal RCT with treatment C and outcome E deductively implies ‘C causes E in the experimental population’. Though the move from the RCT to a policy prediction that C will cause E when implemented in a new population often goes under the single label, the external validity of the RCT result, this label hides a host of assumptions that we can begin to be far clearer and more explicit about. Footnote 14 To do so, I think it is useful to break the move conceptually into two steps.

First is the inference that ‘C causes E in the target population’ or in some subpopulation of the target. The probabilistic theory of causality makes clear, albeit at a highly abstract level, just what assumptions are required to support this move. Unfortunately, they are very strong assumptions so one must make this move with caution.

Given the interpretation of the causal claim supplied by the probabilistic theory, this first step is essentially a move to predict what would happen in a new RCT in the target population, or a subpopulation of it, from what happens in an RCT in a different population. If we follow standard usage and describe RCT results as efficacy results, the inference here is roughly from efficacy in the experimental population to efficacy in the target. That is a far cry from what is often described as an effectiveness result for the target: a claim that C will actually result in E when implemented there.

This comprises the second step: the move from ‘C causes E in the target population’ to ‘C will result in E if implemented in this or that way’. Julian Reiss and I have argued, both jointly and separately (Reiss 2007 ; Cartwright 1999 ), that the best way to evaluate effectiveness claims is by the construction of a causal model, where information about the behaviour of C in an RCT is only one small part of the information needed to construct the model. We need as well a great deal of information about the target, especially about the other causally relevant factors at work there, how they interact with each other and with C, and how the existing causal structure might be shifted during policy implementation. I have not rehearsed this argument here but rather, in keeping with my starting question of what RCTs can do for us vis-à-vis policy predictions, I have laid out some assumptions that would allow a more direct inference from ‘C causes E in the experimental population’ to ‘C will result in E if implemented in this or that way’. Again, these are very strong and should be accepted only with caution.

Throughout I have used the probabilistic theory of causality and at that, only a formulation of the theory for dichotomous variables. Footnote 15 This is not the only theory of causality making the rounds by a long shot. But it is a theory with enough of the right kind of content to show just why RCTs secure internal validity and to make clear various assumptions that would support external validity. Something similar can be reconstructed for the counterfactual theory and for Judea Pearl’s account that models causal laws as linear functional laws, with direction, adding on Bayes-nets axioms (Pearl 2000 ). A good project would be to lay out the assumptions for various ways of inferring policy predictions from RCTs on all three accounts, side-by-side, so that for any given case one could study the assumptions to see which, if any, the case at hand might satisfy—remembering always that if causal conclusions are to be drawn, it is important to stick with the interpretation of the conclusion supplied by the account of causality that is underwriting that conclusion!

My overall point, whether one uses the probabilistic theory or some other, is that securing the internal validity of the RCT is not enough. That goes only a very short way indeed towards predicting what the cause studied in the RCT will do when implemented in a different population. Of course all advocates of RCTs recognize that internal validity is not external validity. But the gap is far bigger than most let on.

There are other routes to RCTs from other accounts of causality. The counterfactual account is noteworthy here (see especially Cartwright ( 1989 ) and Rubin ( 1974 ) and discussions thereof), but the link there requires more heroic assumptions than the link from the probabilistic theory.

A state description over factors A 1 , …, A n is a conjunction on n conjuncts, one for each A i , with each conjunct either A i or ¬A i .

See discussion in Cartwright ( 1983 ).

The assumption that causes and effects are always probabilistically dependent is sometimes called ‘faithfulness’ in the causal Bayes-nets literature. Some authors argue that violations are rare. Others—like Kevin Hoover and me—argue the contrary: many systems are designed or evolved to ensure causes cancel. I shall not enter this debate here, however. For further discussion and references, see Cartwright ( 2007 ).

See the definitions in Cartwright ( 1989 , p. 112) and in Cartwright ( 2007 , 1976 , p. 64, footnote 8). The task is relatively easy if one can take the notion of a causal path as primitive. In that case a full set of causes of E contains exactly one factor from each path into E.

It need not be a proper subset.

For examples, see Pearl’s linear causal structures (Pearl 2000 ) or my representations for structures in which causes act irreducibly probabilistically (Cartwright 1989 ).

As above, this argument sketch can be filled in more precisely once a particular kind of underlying causal structure is supposed. (The argument also supposes that the causal structure is the same in both treatment and control wings.).

But see for instance Altman ( 1996 ) and Worrall ( 2002 ).

See for instance the U.S. Food and Drug Administration medication guide at www.fda.gov/cder/drug/antidepressants/SSRIMedicationGuide.htm .

Or at least the same facts about whether the probability of E is greater conditioned on C as opposed to ¬C.

This is a special case of the next requirement; I write it separately to highlight it.

Cf. the famous Lucas critique of policy reasoning (Lucas 1976 ).

For further discussion see also Cartwright ( 1976 ) and Cartwright and Efstathiou ( 2007 ).

I personally am not happy with any extant general formulations for multivalued or continuous variables. That is connected with my view that a proper formulation must be relativised to a specific kind of system of causal laws; for different systems, different formulations will be appropriate.

Altman, D. G. (1996). Editorials: Better reporting of randomised controlled trials: The CONSORT statement. British Medical Journal, 313 , 570–571.

Google Scholar

Blatchford, P. (2003). The class size debate: Is smaller better? Philadelphia: Open University Press.

Cartwright, N. (1983). How the laws of physics lie . New York: Oxford University Press.

Book Google Scholar

Cartwright, N. (1989). Nature’s capacities and their measurement . Oxford: Oxford University Press.

Cartwright, N. (1999). The dappled world: A study of the boundaries of science . Cambridge: Cambridge University Press.

Cartwright, N. (2007). Hunting causes and using them: Approaches in philosophy and economics . New York: Cambridge University Press.

Cartwright, N. (forthcoming ‘a’). ‘What is this Thing Called ‘Efficacy’?’ In C. Mantzavinos (Ed.), Philosophy of the social sciences. Philosophical theory and scientific practice , Cambridge University Press.

Cartwright, N. (forthcoming ‘b’). Evidence-based policy: What’s to be done about relevance?’ A talk given at Oberlin College Colloquium April 2008.

Cartwright, N. & Efstathiou, S. (2007, August). Hunting causes and using them: Is there no bridge from here to there ? Paper presented at the First Biennial conference of the Philosophy of Science in Practice, Twente University.

Dupré, J. (1984). Probabilistic causality emancipated. Midwest Studies in Philosophy, 9 , 169–175.

Article Google Scholar

Heckman, J. J. (1991) Randomization and social policy evaluation, NBER Working Paper No. T0107.

Lucas, R. E. (1976). Econometric policy evaluation: A critique. Carnegie Rochester Conference Series on Public Policy, 1 , 19–46.

Pearl, J. (2000). Causality: Models reasoning and inference . Cambridge: Cambridge University Press.

Reiss, J. (2007). Error in economics: The methodology of evidence-based economics . London: Routledge.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66 (5), 688–701.

Salmon, W., Jeffrey, R., & Greeno, J. (Eds.). (1971). Statistical explanation and statistical relevance . Pittsburgh: Pittsburgh University Press.

Skyrms, B. (1980). Causal necessity . New Haven, USA: Yale University Press.

Suppes, P. (1970). A probabilistic theory of causality . Amsterdam: North-Holland Publishing Company.

U.S. Department of Education Institute of Education Sciences National Center for Education Evaluation and Regional Assistance. (2003). Identifying and implementing educational practices supported by rigorous evidence: A user friendly guide http://www.ed.gov/rschstat/research/pubs/rigorousevid/rigorousevid.pdf . Accessed 29 August 2008.

Worrall, J. (2002). What evidence in evidence-based medicine. Philosophy of Science , Vol. 69, No.3, In Supplement: Proceedings of the 2000 biennial meeting of the philosophy of science association . Part II: Symposia papers, pp. S316–S330.

Download references

Acknowledgements

I would like to thank Chris Thompson for his help, the editors and referees for useful suggestions, and the Spencer Foundation and the UK Arts and Humanities Research Council for support.

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and affiliations.

Centre for the Philosophy of Natural and Social Sciences, London School of Economics, Houghton Street, WC2A 2AE, London, UK

Nancy Cartwright

Department of Philosophy, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093-0119, USA

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nancy Cartwright .

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cartwright, N. What are randomised controlled trials good for?. Philos Stud 147 , 59–70 (2010). https://doi.org/10.1007/s11098-009-9450-2

Download citation

Published : 01 October 2009

Issue Date : January 2010

DOI : https://doi.org/10.1007/s11098-009-9450-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Randomized controlled trials (RCTs)
External validity
Probabilistic theory of causality
Causal inference
Contributions
Find a journal
Publish with us
Track your research

Skip to secondary menu
Skip to main content
Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Causation in Statistics: Hill’s Criteria

By Jim Frost 11 Comments

Causation indicates that an event affects an outcome. Do fatty diets cause heart problems? If you study for a test, does it cause you to get a higher score?

In statistics , causation is a bit tricky. As you’ve no doubt heard, correlation doesn’t necessarily imply causation. An association or correlation between variables simply indicates that the values vary together. It does not necessarily suggest that changes in one variable cause changes in the other variable. Proving causality can be difficult.

If correlation does not prove causation, what statistical test do you use to assess causality? That’s a trick question because no statistical analysis can make that determination. In this post, learn about why you want to determine causation and how to do that.

Relationships and Correlation vs. Causation

The expression is, “correlation does not imply causation.” Consequently, you might think that it applies to things like Pearson’s correlation coefficient . And, it does apply to that statistic. However, we’re really talking about relationships between variables in a broader context. Pearson’s is for two continuous variables . However, a relationship can involve different types of variables such as categorical variables , counts, binary data, and so on.

For example, in a medical experiment, you might have a categorical variable that defines which treatment group subjects belong to—control group, placebo group, and several different treatment groups. If the health outcome is a continuous variable, you can assess the differences between group means. If the means differ by group, then you can say that mean health outcomes depend on the treatment group. There’s a correlation, or relationship, between the type of treatment and health outcome. Or, maybe we have the treatment groups and the outcome is binary, say infected and not infected. In that case, we’d compare group proportions of the infected/not infected between groups to determine whether treatment correlates with infection rates.

Through this post, I’ll refer to correlation and relationships in this broader sense—not just literal correlation coefficients . But relationships between variables, such as differences between group means and proportions, regression coefficients , associations between pairs of categorical variables , and so on.

Why Determining Causality Is Important

photograph of dominoes falling to illustrate causation.

If you’re only predicting events, not trying to understand why they happen, and do not want to alter the outcomes, correlation can be perfectly fine. For example, ice cream sales correlate with shark attacks. If you just need to predict the number of shark attacks, ice creams sales might be a good thing to measure even though it’s not causing the shark attacks.

However, if you want to reduce the number of attacks, you’ll need to find something that genuinely causes a change in the attacks. As far as I know, sharks don’t like ice cream!

There are many occasions where you want to affect the outcome. For example, you might want to do the following:

Improve health by using medicine, exercising, or flu vaccinations .
Reducing the risk of adverse outcomes, such as procedures for reducing manufacturing defects.
Improving outcomes, such as studying for a test.

For intentional changes in one variable to affect the outcome variable, there must be a causal relationship between the variables. After all, if studying does not cause an increase in test scores, there’s no point for studying. If the medicine doesn’t cause an improvement in your health or ward off disease, there’s no reason to take it.

Before you can state that some course of action will improve your outcomes, you must be sure that a causal relationship exists between your variables.

Confounding Variables and Their Role in Causation

How does it come to be that variables are correlated but do not have a causal relationship? A common reason is a confounding variable that creates a spurious correlation. A confounding variable correlates with both of your variables of interest. It’s possible that the confounding variable might be the real causal factor ! Let’s go through the ice cream and shark attack example.

In this example, the number of people at the beach is a confounding variable. A confounding variable correlates with both variables of interest—ice cream and shark attacks in our example.

In the diagram below, imagine that as the number of people increases, ice cream sales also tend to increase. In turn, more people at the beach cause shark attacks to increase. The correlation structure creates an apparent, or spurious, correlation between ice cream sales and shark attacks, but it isn’t causation.

Diagram that shows correlations structure for a confounding variable the produces correlation and not causation.

Confounders are common reasons for associations between variables that are not causally connected.

Related post : Confounding Variables Can Bias Your Results

Causation and Hypothesis Tests

Before moving on to determining whether a relationship is causal, let’s take a moment to reflect on why statistically significant hypothesis test results do not signify causation.

Hypothesis tests are inferential procedures . They allow you to use relatively small samples to draw conclusions about entire populations. For the topic of causation, we need to understand what statistical significance means.

When you see a relationship in sample data, whether it is a correlation coefficient, a difference between group means, or a regression coefficient, hypothesis tests help you determine whether your sample provides sufficient evidence to conclude that the relationship exists in the population . You can see it in your sample, but you need to know whether it exists in the population. It’s possible that random sampling error (i.e., luck of the draw) produced the “relationship” in your sample.

Statistical significance indicates that you have sufficient evidence to conclude that the relationship you observe in the sample also exists in the population.

That’s it. It doesn’t address causality at all.

Related post : Understanding P-values and Statistical Significance

Hill’s Criteria of Causation

Determining whether a causal relationship exists requires far more in-depth subject area knowledge and contextual information than you can include in a hypothesis test. In 1965, Austin Hill, a medical statistician, tackled this question in a paper* that’s become the standard. While he introduced it in the context of epidemiological research, you can apply the ideas to other fields.

Hill describes nine criteria to help establish causal connections. The goal is to satisfy as many criteria possible. No single criterion is sufficient. However, it’s often impossible to meet all the criteria. These criteria are an exercise in critical thought. They show you how to think about determining causation and highlight essential qualities to consider.

Studies can take steps to increase the strength of their case for a causal relationship, which statisticians call internal validity . To learn more about this, read my post about internal and external validity .

A strong, statistically significant relationship is more likely to be causal. The idea is that causal relationships are likely to produce statistical significance. If you have significant results, at the very least you have reason to believe that the relationship in your sample also exists in the population—which is a good thing. After all, if the relationship only appears in your sample, you don’t have anything meaningful! Correlation still does not imply causation, but a statistically significant relationship is a good starting point.

However, there are many more criteria to satisfy! There’s a critical caveat for this criterion as well. Confounding variables can mask a correlation that actually exists. They can also create the appearance of correlation where causation doesn’t exist, as shown with the ice cream and shark attack example. A strong relationship is simply a hint.

Consistency and causation

When there is a real, causal connection, the result should be repeatable. Other experimenters in other locations should be able to produce the same results. It’s not one and done. Replication builds up confidence that the relationship is causal. Preferably, the replication efforts use other methods, researchers, and locations.

In my post with five tips for using p-values without being misled , I emphasize the need for replication.

Specificity

It’s easier to determine that a relationship is causal if you can rule out other explanations. I write about ruling out other explanations in my posts about randomized experiments and observational studies. In a more general sense, it’s essential to study the literature, consider other plausible hypotheses, and, hopefully, be able to rule them out or otherwise control for them. You need to be sure that what you’re studying is causing the observed change rather than something else of which you’re unaware.

It’s important to note that you don’t need to prove that your variable of interest is the only factor that affects the outcome. For example, smoking causes lung cancer, but it’s not the only thing that causes it. However, you do need to perform experiments that account for other relevant factors and be able to attribute some causation to your variable of interest specifically.

For example, in regression analysis , you control for other factors by including them in the model .

Temporality and causation

Causes should precede effects. Ensure that what you consider to be the cause occurs before the effect . Sometimes it can be challenging to determine which way causality runs. Hill uses the following example. It’s possible that a particular diet leads to an abdominal disease. However, it’s also possible that the disease leads to specific dietary habits.

The Granger Causality Test assesses potential causality by determining whether earlier values in one time series predicts later values in another time series. Analysts say that time series A Granger-causes time series B when significant statistical tests indicate that values in series A predict future values of series B.

Despite being called a “causality test,” it really is only a test of prediction. After all, the increase of Christmas card sales Granger-causes Christmas!

Temporality is just one aspect of causality!

Biological Gradient

Hill was a biologist, hence the focus on biological questions. He suggests that for a genuinely causal relationship, there should be a dose-response type of relationship. If a little bit of exposure causes a little bit of change, a larger exposure should cause more change. Hill uses cigarette smoking and lung cancer as an example—greater amounts of smoking are linked to a greater risk of lung cancer. You can apply the same type of thinking in other fields. Does more studying lead to even higher scores?

However, be aware that the relationship might not remain linear. As the dose increases beyond a threshold, the response can taper off. You can check for this by modeling curvature in regression analysis .

Plausibility

If you can find a plausible mechanism that explains the causal nature of the relationship, it supports the notion of a causal relationship. For example, biologists understand how antibiotics inhibit microbes on a biological level. However, Hill points out that you have to be careful because there are limits to scientific knowledge at any given moment. A causal mechanism might not be known at the time of the study even if one exists. Consequently, Hill says, “we should not demand” that a study meets this requirement.

Coherence and causation

The probability that a relationship is causal is higher when it is consistent with related causal relationships that are generally known and accepted as facts. If your results outright disagree with accepted facts, it’s more likely to be correlation. Assess causality in the broader context of related theory and knowledge.

Experiments and causation

Randomized experiments are the best way to identify causal relationships. Experimenters control the treatment (or factors involved), randomly assign the subjects, and help manage other sources of variation. Hill calls satisfying this criterion the strongest support for causation. However, randomized experiments are not always possible as I write about in my post about observational studies. Learn more about Experimental Design: Definition, Types and Examples .

Related posts : Randomized Experiments and Observational Studies

If there is an accepted, causal relationship that is similar to a relationship in your research, it supports causation for the current study. Hill writes, “With the effects of thalidomide and rubella before us we would surely be ready to accept slighter but similar evidence with another drug or another viral disease in pregnancy.”

Determining whether a correlation also represents causation requires much deliberation. Properly designing experiments and using statistical procedures can help you make that determination. But there are many other factors to consider.

Use your critical thinking and subject-area expertise to think about the big picture. If there is a causal relationship, you’d expect to see consistent results that have been replicated, other causes have been ruled out, the results fit with established theory and other findings, there is a plausible mechanism, and the cause precedes the effect.

Austin Bradford Hill, “The Environment and Disease: Association or Causation?,” Proceedings of the Royal Society of Medicine , 58 (1965), 295-300.

Reader Interactions

December 2, 2020 at 9:06 pm

I believe there is a logical flaw in the movie “Good Will Hunting”. Specifically, in the scene where psychologist Dr. Sean Maguire (Robin Williams) tells Will (Matt Damon) about the first time he met his wife, there seems to be an implied assumption that if Sean had gone to “the game” (Game 6 of the World Series in 1975), instead of staying at the bar where he had just met his future wife, then the very famous home run hit by Carlton Fisk would still have occurred. I contend that if Sean had gone to the game, the game would have played out completely differently, and the famous home run which actually occurred would not have occurred – that’s not to say that some other famous home run could not have occurred. It seems to be clear that neither characters Sean nor Will understand this – and I contend these two supposedly brilliant people would have known better! It is certainly clear that neither Matt Damon nor Ben Affleck (the writers) understand this. What do you think?

August 24, 2019 at 8:00 pm

Hi Jim Thanks for the great site and content. Being new to statistics I am finding it daunting to understand all of these concepts. I have read most of the articles in the basics section and whilst I am gaining some insights I feel like I need to take a step back in order to move forward. Could you recommend some resources for a rank beginner such as my self? Maybe some books that you read when you where starting out that where useful. I am really keen to jump in and start doing some statistics but I am wondering if it is even possible for someone like me to do so. To clearly define my question where is the best place to start?? I realize this doesn’t really relate to the above article but hopefully this question might be useful to others as well. Thanks.

August 25, 2019 at 2:45 pm

I’m glad that my website has been helpful! I do understand your desire to get the pick picture specifically for starting out. In just about a week, September 3rd to be exact, I’m launching a new ebook that does just that. The book is titled Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries . My goal is to provide the big picture about the field of statistics. It covers the basics of data analysis up to larger issues such as using experiments and data to make discoveries.

To be sure that you receive the latest about this book, please subscribe to my email list using the form in the right column of every page in my website.

August 16, 2019 at 12:55 am

Jim , I am new to stats and find ur blog very useful. Yet , I am facing an issue of very low R square values , as low as 1 percent, 3 percent… do we still hold these values valid? Any references on research while accepting such low values . request ur valuable inputs please.

August 17, 2019 at 4:11 pm

Low R-squared can be a problem. It depends on several other factors. Are any independent variables significant? Is the F-test of overall significance significant?

I have posts about this topic and answers those questions. Please read: Low R-squared values and F-test of overall significance .

If you have further questions, please post them in the comments section of the relevant post. It helps keep the questions and answers organized for other readers. Thanks!

June 27, 2019 at 11:23 am

Thank you so much for your website. It has helped me tremendously with my stats, particularly regression. I have a question concerning correlation testing. I have a continuous dependent variable, quality of life, and 3 independent variables, which are categorical (education = 4 levels, marital status = 3 levels, stress = 3 levels). How can I test for a relationship among the dependent and independent variables? Thank you Jim.

June 27, 2019 at 1:30 pm

You can use either ANOVA or OLS regression to assess the relationship between categorical IVs to a continuous DV.

I write about this in my ebook, Regression Analysis: An Intuitive Guide . I recommend you get that ebook to learn about how it works with categorical IVs. I discuss that in detail in the ebook. Unfortunately, I don’t have a blog post to point you towards.

Best of luck with your analysis!

June 25, 2019 at 3:24 pm

great post, Jim. Thanks!

June 25, 2019 at 11:32 am

Useful post

June 24, 2019 at 4:51 am

Very nice and interesting post. And very educational. Many thanks for your efforts!

June 24, 2019 at 10:13 am

Thank you very much! I appreciate the kind words!

Comments and Questions Cancel reply

How to Prove Causation

When you can’t run an actual experiment, introduce pseudo-randomness.

Correlation is a really useful variable. It tells you that two variables tend to move together. It’s also one of the easiest things to measure in statistics and data science. All you need is literally one line of code (or a simple formula in Excel) to calculate the correlation.

Machine learning models, both the predictive kind and the explanatory kind, are built with correlations as their foundations. For example, the usefulness of a forecasting model is based heavily on your ability to find and engineer some feature variables that are highly correlated with whatever it is you are trying to predict.

But correlation is not causation — I bet you’ve heard this before. A lot of times this doesn’t matter, but sometimes it matters a lot. It depends on what question you are trying to answer. If all you care about is prediction (like what will the stock market do next month?), then we don’t care much about the distinction between correlation and causation.

But if we’re trying to decide between several policy options to invest in, and we want the chosen policy to affect some sort of positive outcome, then we better be sure that it really will. In this case, we care greatly about causality. If we’re wrong and mistake correlation for true causation, we could end up wasting millions of dollars and years of effort.

Say , for instance, we observed a high correlation between hair loss and wealth. We would really regret it if we ripped all our hair out expecting money to start pouring into our bank accounts. This is an example of non-causal correlation — the issue is that there’s some other missing variable that’s the true driver that causes the correlation we observe. In our case, it might be because there were a lot of really stressed out entrepreneurs in our sample, and these people worked night and day, ultimately trading their hair and some of their health for a big payout.

So how can we measure causation?

The Ideal Way: Random Experiments

The purest way to establish causation is through a randomized controlled experiment (like an A/B test) where you have two groups — one gets the treatment, one doesn’t. The critical assumption is that the two groups are homogenous — meaning that there are no systematic differences between the two groups (besides one getting the treatment and the other not) that can bias the result.

If the group that gets the treatment reacts positively, then we know there is causation between the treatment and the positive effect that we observe. We know this because the experiment was carefully designed in a way that controls for all other explanatory factors besides the thing we are testing. So any observed difference (that’s statistically significant) between the two groups must be attributable to the treatment.

Follow Tony Yiu’s Work Alpha Beta Blog

What if We Can’t Run an Experiment?

The problem is that, in reality, we often can’t run randomized controlled experiments. Most of the time, we only have empirical data to work with. Don’t get me wrong, empirical data is great, but it’s lacking when it comes to proving causation.

With empirical data, we often run into the chicken- and - the - egg problem. For example, at a previous company, I was tasked with a project to prove that our investment advisory service helped increase users’ savings rates. There was a strong correlation between signing up for our service and increased savings rates — people who signed up for our services were much more likely to increase savings than those who didn’t.

But correlation is not enough. Another plausible explanation is that those people who want to save more are the ones that sign up for our service. In that case, it’s not that our service helped them save more, but rather that signing up for our service was a byproduct of wanting to save more (the chicken- and - the - egg problem). So if this were true, then if a company paid for subscriptions to our advisory service for its employees, it would not see an increase in their savings rates (because it’s not causation).

Testing for Causation Using Pseudo-Randomness

So how do we get around this when there’s no way to run an actual experiment? We need to look for events that introduce pseudo-randomness.

Recall that the critical assumption that allows us to prove causation with an A/B test is that the two groups are homogenous. Thus, all differences in outcomes can be attributed to the treatment. So when we can’t run an experiment, we need to look for sub-periods or sub-portions of our data that happen to produce two homogenous groups that we can compare.

Let’s see what I mean by that through our earlier investment advisory service example.

In my case, fortunately there were a handful of companies that opted all their new employees into our service starting in 2014. I could compare how these new employees’ savings rates evolved over time relative to new employees in the years prior to 2014. The big assumption I’d be making here is that the pre- and post-2014 new employee cohorts at these companies were pretty similar across all the characteristics that mattered (such as age, education and salary).

Here, the employee’s job start date (whether it was before 2014 or not) is known as an instrumental variable. An instrumental variable is one that varies the probability of the treatment without altering the outcome. In other words, it successfully isolates the impact of the treatment across the two groups and creates a reasonable approximation of a randomized controlled experiment.

The new employees pre-2014 were indeed reasonably similar to the new employees that joined these companies in 2014 and after (yes, I checked). But the ones after 2014 were defaulted into our advisory service — this allowed me to compare two reasonably homogenous groups where the only major difference between groups was in whether they were defaulted into our service or not.

It turned out that the post-2014 new employees did increase their savings at a higher rate than the pre-2014 cohort. As expected, the increase was less than that observed across the entire population.

So yes, our service was definitely more attractive to those who already wanted to save more (there was a decent amount of non-causal correlation). But even after removing this effect through an instrumental variable, I still found a statistically significant difference in the savings rate increases of the two groups. And yes, my company’s service did help users increase their savings rates.

Read More From Our Expert Contributors Is ‘Statistical Significance’ Just a Strawman?

The Value of Determining Causality

Causation is never easy to prove. I got lucky that there was a feasible instrumental variable to use. But generally, good instrumental variables will not be easy to find — you will have to think creatively and really know your data well to uncover them.

But it can be worth it. When you are thinking of invest ing significant amounts of resources and time, it’s not enough to know that something is correlated to the effect you are after. You need to be reasonably certain that there’s a real causal relationship between the action you are thinking about taking and the effect that you desire.

Recent Data Science Articles

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 21 August 2024

A randomized controlled experiment testing the use of virtual reality to trigger cigarette craving in people who smoke

Aitor Rovira 1 , 2 ,
Sinéad Lambe 1 , 2 ,
Helen Beckwith 1 , 2 ,
Memoona Ahmed 1 ,
Felicity Hudson 1 ,
Phoebe Haynes 1 ,
Chun-Jou Yu 3 ,
Kira Williams 1 ,
Simone Saidel 1 ,
Ellen Iredale 1 ,
Sapphira McBride 1 ,
Felicity Waite 1 , 2 ,
Xueni Pan 3 &
Daniel Freeman 1 , 2

Scientific Reports volume 14 , Article number: 19445 ( 2024 ) Cite this article

285 Accesses

5 Altmetric

Metrics details

Human behaviour

Automated delivery of therapy in virtual reality (VR) has the potential to be used for smoking cessation. Most obviously, it could be used to practise and establish alternative reactions to smoking cues. The first step in treatment development is to show that VR environments can trigger sufficient cravings in smokers. We evaluated a new VR public house outdoor scenario with 100 individuals who smoked daily. Participants were randomly assigned to the VR scenario with smoking cues or a neutral experience in VR. The VR experiences were presented in a standalone VR headset. Before and after VR, we collected self-reported craving scores for cigarettes and alcohol using the Tobacco Craving Questionnaire (TCQ) and visual analogue scales (VAS). Physiological data were also collected. Compared to the neutral condition, exposure to the smoking cues led to a large increase in craving for a cigarette (TCQ β = 11.44, p < 0.0001, Cohen’s d = 1.10) and also a moderate increase in craving for alcohol \((\upbeta =0.7,\text{ p}=0.017,\text{ d}=0.50)\) . There were no significant physiological differences between the two conditions. These results provide good evidence that VR experiences can elicit strong craving for cigarettes. The programming can be part of developing a new VR cognitive therapy to help people reduce smoking.

Cigarette craving in virtual reality cue exposure in abstainers and relapsed smokers

Virtual reality: a powerful technology to provide novel insight into treatment mechanisms of addiction

Isradipine augmentation of virtual reality cue exposure therapy for tobacco craving: a triple-blind randomized controlled trial

Introduction.

People often smoke in response to specific cues such as seeing a cigarette, ashtray, or matches 1 , 2 . Hence exposure to smoking cues is an important step in therapies designed to build resilience to craving 3 . Presentation of smoking cues within virtual reality (VR) has been shown to elicit cigarette craving 4 . There are two key advantages of use of VR. First, multiple different smoking cues and scenarios, graduated in difficulty, can be easily presented and there are no actual cigarettes present to smoke and reinforce the established response. Second, it is now possible to automate delivery of therapy within VR 5 . We have successfully piloted with thirteen smokers a VR smoking environment delivered in the new generation of standalone VR headsets 6 . In this paper, we report a definitive test of this environment as the first step in developing a new VR therapy for smoking cessation.

One important reason that prevents people who smoke from successfully quitting is the difficulty of not responding to everyday smoking cues 1 . Smoking cues can be pervasive in daily lives and exposure to these cues is a predictor of smoking 7 . These cues may be specific items related to smoking, such as ashtrays and cigarette butts, and general environments where people usually smoke, such as a bar, and can include time-related events such as a morning coffee routine 8 . Exposure to smoking cues elicits craving, and craving has been identified as the mediator that leads to smoking 9 , 10 , 11 . Drinking alcohol is another well-recognised cue for smoking 12 and people who drink alcohol are more likely to smoke too 13 .

Exposure to smoking cues triggering craving for a cigarette is a well-replicated phenomenon 1 , 14 . The results are consistent across different means of presentation, including pictures 15 , video 16 , feature films 17 , and VR 18 . VR clearly provides a higher degree of experimental control compared to studying occurrences of smoking in a natural setup and also provides a higher degree of immersion and interaction than 2D technologies, in which a typical setup offers a reduced field of view and participants are simply spectators 19 . Furthermore, VR allows the placement of people in a surrounding virtual environment that may be associated to smoking, thus triggering craving from a broad contextual cue 20 .

In a systematic review of 18 studies that involved 541 smokers, it has been shown that VR presentation of cues can produce a large triggering of craving (Cohen’s d = 1.0) 21 . In the largest study to date we wanted to show that similar effects can be produced from delivery of VR scenes in the new generation of standalone headsets. This could then form the basis of the development of a new cognitive intervention for smoking cessation. There is extensive evidence in the literature that exposure to smoking cues in VR triggers craving 18 , 22 , 23 . However, only a few papers have reported the results of randomised control tests of the use of immersive VR technologies so far. A number of studies used VR to expose smokers to cues 24 or used VR to try to improve the results of an approach bias modification approach 25 . Other studies used more limited technologies such as 360-degree videos 26 and Second Life 27 . These studies have taken different approaches in experimental design or the experimental setup, making it difficult to compare their results. None have tested long-term effects.

Subjective measurements are the most common way to assess cravings. A small number of studies have also supplemented self-report with objective measurements such as physiological data. For example, it has been suggested that craving can be a predictor of physiological arousal 28 , and skin conductance has been shown to increase after exposure to smoking cues 29 . Therefore we also included physiological measures in our test.

In our pilot study, simple pre and post testing with 13 smokers indicated that our new VR environment may increase smoking craving 6 . In this paper, we present the results of a randomised controlled study with 100 participants using the same VR smoking cue scenario. We collected self-reported measurements through questionnaires and physiological data related to heart rate and skin conductance.

Experimental design

The study was a between-subject experiment in which we carried out a between group comparison of smoking craving scores after going through a VR experience, either a neutral environment or a scenario depicting potential smoking cues. Participants were randomly allocated to an experimental group using the online tool Sealed Envelope ( https://www.sealedenvelope.com/ ). Ethics approval was granted by the University of Oxford Central University Research Ethics Committee (reference R81586/RE001). All research was performed in accordance with relevant guidelines/regulations, and written informed consent was provided by all participants.

Participants

Participants were recruited through advertisements on social media and local radio stations. The inclusion criteria were: over 18 years old and smoke a minimum of 10 cigarettes per day. The exclusion criteria were: photosensitive epilepsy; significant visual, auditory, or balance impairment; insufficient comprehension of English to complete study measures; using nicotine replacements; primary diagnosis of another substance dependency; or medication that reduces nicotine cravings (e.g. Bupropion).

The main outcome of the study was the score from the Tobacco Craving Questionnaire 30 . It is a 12-item short version of the long 47-item questionnaire. The long version is validated and reported as being reliable for research 31 and the short version has similar internal consistency to the original version. It assesses craving for a cigarette at the time of filling it out, with answers on a 1 (strongly disagree) to 7 (strongly agree) Likert scale. Participants filled out this questionnaire before and after the VR experience. Scores can range from 12 to 84. Higher scores indicate greater craving for a cigarette.

As additional outcomes, participants provided subjective scores of their current cravings of cigarettes and alcohol using two visual analogue scales (VAS), with answers from 0 (Do not want one at all) to 10 (Extremely want one). Participants provided their current cravings before and after the VR experience. Higher scores indicate greater craving. Although internal consistency cannot be checked due to being just a single item, the use of VAS has been an increasingly popular method to quantify subjective experiences (e.g. pain 32 ), and has been found to be a valid way to measure intensity of cigarette craving 33 .

At baseline participants completed the Heaviness of Smoking Index (HSI) 34 , which is a 2-item questionnaire to assess how much a person smokes, including the questions “How many cigarettes do you typically smoke per day?” and “How soon after you wake up do you have your first cigarette?” Answers were given as categorical numbers. These two measurements have been shown to be fairly reliable when used either separately or together 35 .

At baseline participants completed the AUDIT-C 36 , which is a three-item questionnaire to assess average alcohol use. Multiple studies have validated this questionnaire 37 .

Electrodermal activity and heart rate were recorded during baseline and the VR experience with the use of an Empatica E4 wristband ( https://www.empatica.com/en-gb/research/e4/ ). Data included two pairs of inter beat interval (IBI) and electrodermal activity (EDA) files, one pair for baseline and the other recorded during the VR experience.

VR scenarios

We used the Meta Quest 2 VR headset in standalone mode for all the VR sessions. That means we used the VR headset without using a computer to run the simulation. There were two scenarios, one for the experimental group and one for the control group. Each lasted three minutes. In both scenarios participants sat down for the entire duration of the experience.

In the experimental group, participants were placed in an environment that resembled a British pub outdoor space 6 (see Fig. 1 ). There were several people sitting around as on a typical warm sunny day. There were several items related to smoking cues in the scenario—pint glasses, ashtrays with cigarette butts, one of them half extinguished and still releasing a trail of smoke. On the bench next to the participant, two virtual characters were chatting. Over time their discussion turned towards cigarettes and how hard it is to quit smoking. At the end of the scenario, one of the characters turned their head towards the participant and asked them directly if they wanted a cigarette.

Screenshots of the VR scenarios taken from the initial perspective of the participant (1) the beer garden; (2) the room in the neutral environment. Images created in Unity 2020.3.3f1 ( https://unity.com/releases/editor/whats-new/2022.3.3 ).

In the control group, participants visited a neutral environment in a modern house with wide windows, similar to the welcome room described in 38 (see Fig. 1 ). The landscape outside included different types of vegetation, a water stream, and clear blue sky. The environment included quiet background music. This environment did not contain any smoking cues.

No other hardware was required besides the VR headset, the Empatica E4 wristband, and a smartphone to record the physiological data.

Experimental procedures

Participants were asked to refrain from smoking 30 min before coming to the VR lab. Upon arrival, they were met at the reception area by a researcher who guided them to the VR lab. Once in the lab, they were asked to confirm that they had read the information sheet at least 24 h prior to the VR session and if willing to participate, to sign a consent form. After agreeing to participate, they filled out the AUDIT-C, the Tobacco Craving Questionnaire, and the visual analogue scales to obtain baseline measurements for the initial craving scores.

When they completed the questionnaires, they were randomised and allocated to the experimental condition they were instructed to remain seated for the rest of the session. A researcher helped them put the Empatica E4 wristband on their dominant arm and recorded two minutes of physiological data as a baseline. After that, participants put the VR headset on, the researcher made sure that vision was clear and the headset had been adjusted to the participant’s comfort, the Empatica E4 started recording data once again, and the VR experience started.

After the VR experience ended, the researcher helped them to take the VR headset off and were asked to fill out the Tobacco Craving Questionnaire and the visual analogue scales again. Participants were compensated twenty pounds for their time.

Data analysis

Analyses were conducted in R version 4.3.0 39 . The main outcome was the craving score on the TCQ after the VR experience. We carried out a linear regression analysis with experimental group (StudyCondition) as the independent variable and controlling for initial craving scores.

The scores reported by participants on the two VASs (cigarettes and alcohol) were also analysed using linear regression. Similarly, we used group as the independent variable and controlling for initial craving scores. Linear regression analyses were carried out in R using the lm function. Effect sizes were summarised as Cohen’s d values calculated using the cohen.d function in R.

We tested scores for how much participants smoked (HSI) and drank (AUDIT-C) on average as possible moderators of the main outcome with the following equation:

Electrodermal activity (EDA) data pre-processing and initial visual analysis was carried out using the Matlab-based tool Ledalab 40 . We carried out a visual inspection on each dataset to detect anomalies in the data. We discarded the data from participants if more than 50% of their data were zero on either the baseline or the VR experience dataset. We also discarded the data showing sudden jumps that were too abrupt to be attributed to a change in skin conductance and did not recover to the original level after a few seconds.

Data cleaning 41 included data trimming, smoothing, and correction of artifacts originated from bad readings. We trimmed a few seconds in the beginning and in the end of the dataset as it was common that there were a few faulty readings at both ends. Trimming was done manually keeping the data from the moment the function looked stable. We smoothed out the data to remove high frequency noise using a filter with a Gauss window size 8. We also removed any isolated spike due to bad readings and we reconstructed the signal with either a linear or a Spline interpolation, depending on what was more suitable in each case. The sample rate was kept at the recording rate of 4 Hz.

We split the signal between the tonic and the phasic components using continuous decomposition analysis 40 . The tonic component provides an overall background level and tendency over time of the signal, while the phasic component contains the information about sudden peaks and changes. We then analysed both components separately. We looked at the mean and standard deviation in the tonic component during the VR experience relative to the baseline. For this, we divided both the mean and standard deviation obtained in the VR experience by the values obtained in the baseline. We also looked at the skin conductance level (SCL) as the gradient of the tonic component. In the phasic component, we studied the mean and standard deviation relative to baseline the same way we calculated it in the tonic component. We then compared these extracted features between experimental groups.

Regarding heart rate data, we were interested in the heart rate variability (HRV), calculated from the IBI data. These data are processed only when the two beats are detected, thus the number of samples varied between participants. The mean and standard deviation of the HRV used in the statistical analysis were also relative to the baseline values.

We analysed all the features extracted from both EDA and IBI signals in a linear regression with the experimental group as the sole independent variable.

30 male and 22 female participants were allocated to the experimental group. The average age was 39.12 (SD = 15.12). In the control group there were 27 male and 21 female participants, with an average age of 37.77 (SD = 14.49). No participants selected their gender as either ‘non-binary’ or ‘preferred not to say’. In the experimental group, the average HSI score was 3.33 (SD = 1.20) and the AUDIT-C score was 5 (SD = 3.01), and in the control group, the average HSI was 3.25 (SD = 0.84) and the average AUDIT-C score was 6.25 (SD = 2.97).

Table 1 shows the scores of the three questionnaire outcomes. All three scores obtained after the VR experience were statistically different between experimental groups. Compared to the neutral condition, the experimental group had large effect size increases in cigarette craving and a moderate increase in craving for alcohol. The Heaviness of Smoking Index reported during screening predicted cigarette craving after the VR session across both experimental groups (HSI, p < 0.0001) and alcohol use did not (AUDIT-C, p = 0.85). Looking within groups (i.e. pre to post changes), the experimental group increased in scores on the TCQ (p < 0.01), the VAS for cigarettes (p < 0.0001), and the VAS for alcohol (p < 0.001). The control group decreased in scores on the TCQ (p < 0.0001) and the VAS for cigarettes (p = 0.02) but did not significantly alter in alcohol craving (p = 0.18).

There were missing data from the physiological recordings. We had electrodermal activity (EDA) data from 71 participants (42 in the experimental group and 29 in the control group). Table 2 shows the mean, the standard deviation, and the results from the regression analyses of the different features extracted from the physiological data. The results did not show any statistically significant difference between experimental groups on any of the features extracted.

We conducted the largest experimental test of whether VR simulations can produce craving for cigarettes in people who smoke regularly. Importantly the test used the latest standalone headset without use of an external computer. The VR public house scene produced a large increase in cigarette craving compared to a neutral VR scene. The Cohen’s d for cigarette craving was 1.1, which is similar to the effect size reported in a meta-analysis of studies focused on cue-induced craving in VR 21 . The VR pub experience led to significantly increased levels of craving from before to after immersion (i.e., there was a within group effect), but it should be noted when considering the magnitude of the between groups effect that there was also a significant reduction in cigarette craving in the neutral experience. This reduction may perhaps be explained by the use of VR technology being interesting for the participant and hence distracting from cravings. Furthermore, it is possible that the virtual environment, which had windows with views to a natural landscape, might have been found calming like a relaxation nature scene 42 . The VR pub scene also led to an increase in the smokers in craving for alcohol. The results once again show how VR can induce similar responses to real-world environments. Our VR pub scene could form the basis for the development of a smoking cessation therapy.

We tested whether level of smoking and alcohol consumption affected responses. Neither had a differential affect by the type of VR scene. However, people who reported smoking a greater number of cigarettes had higher levels of cigarette craving in both the VR scene and the neutral scene. In contrast, level of alcohol use did not predict level of cigarette craving in VR. This further validates the use of VR, since it shows that, as would be expected, cravings elicited in VR are affected by a person’s severity of smoking (but not alcohol use).

Regarding the physiological information collected, we explored different features from electrodermal and heart rate data that could be related to craving. The tonic component in the electrodermal data could reveal an overall increase in anxiety. We analysed the mean and standard deviation of this signal relative to the data recorded during baseline. We also looked at skin conductance by calculating the gradient of the linear regression. Electrodermal values naturally change over time, so we predicted that in the experimental group there would be a significantly higher number of participants with a positive slope compared to the control group. We did not find evidence that any feature was statistically different between the randomised groups. Table 2 shows that the values are in the decimals, so we speculate that the signal-to-noise ratio was possibly close to zero dB. Skin conductance values were low. That means that the overall tonic values did not change to any great degree for any participant. Analysing the tonic driver might be more meaningful in longer experiences than three minutes.

The phasic signal is a marker of how participants respond to specific events during the experience. The VR pub scenario contained several smoking cues. We were interested in looking at whether these cues could trigger craving in very specific timestamps thus showing a spike in the data. We analysed the mean and standard deviation of the phasic driver, but the results did not show any difference in the data between the two groups. Data from an accelerometer can provide the information to see if a change in the electrodermal data comes from the movement of limbs. Additionally, electrodermal values can change when people talk, so a voice detection algorithm could be helpful. We finally studied the inter-beat interval to look for statistical differences in the heart rate variability. Again, we looked at the mean and standard deviation relative to baseline for each participant. The results did not show any statistical differences in this case either. Given the strong findings for subjective craving, it is plausible that we did not assess the most useful physiological information to detect it.

The Empatica E4 device has been validated 43 , 44 . However, there were missing data. Researchers need procedures in place for setting it up correctly and to expect that the signal might change if the wristband moves. Recordings were three minutes long. Longer sessions would have provided better estimations of the tonic driver and the overall skin conductance level over time. On the other hand, the phasic signal detecting peaks should not have been too affected by the length of the recording. However, it should be kept in mind that changes in the phasic driver induced by stimuli will be reflected in the signal a few seconds later, between one and five seconds 40 . Our scenario with smoking cues ended with one of the characters looking directly at the participant and offering a cigarette. Recording was stopped right at that moment, whereas it should have carried on longer to capture the response. For the phasic driver, it is important to make an estimation of the signal-to-noise ratio (SNR) to facilitate the task of discriminating the peaks from the noise. We applied a smoothing function when preparing the signal before the decomposition but the phasic signal was not completely noise-free.

There are considerations when using the Empatica E4. If the band is too loose, the readings will vary and data will become unreliable. If the band is too tight, it can create discomfort and interfere with the VR experience. Ideally, the signal acquired needs to be as clean as possible to minimise the amount of post-processing. Another consideration that we noticed is that the wristband has a button that needs to be pressed to start and stop recording data. When the button is pressed, the sensors are pressed against the skin of the person wearing it and that is clearly visible in the data with short spikes and oscillation on the first seconds of the recording, as well as at the end. The data needed to be trimmed but, ideally, the best way to operate the wristband is remotely via the API provided by the manufacturer.

Developing VR therapies based on rigorous experiment is more likely to lead to clinically successful outcomes. This study not only confirms the base potential for VR in helping people smoke less but shows that the scenario can form part of the content of a VR therapy.

Data availability

Deidentified data are available from the corresponding authors on reasonable request and contract with the university.

Droungas, A., Ehrman, R. N., Childress, A. R. & O’Brien, C. P. Effect of smoking cues and cigarette availability on craving and smoking behavior. Addict. Behav. 20 (5), 657–673. https://doi.org/10.1016/0306-4603(95)00029-C (1995).

Article CAS PubMed Google Scholar

Niaura, R. S., Monti, P. M. & Pedraza, M. Relevance of cue reactivity to understanding alcohol and smoking relapse. J. Abnorm. Psychol. 97 (2), 133 (1988).

Martin, T., LaRowe, S. D. & Malcolm, R. Progress in cue exposure therapy for the treatment of addictive disorders: A review update. Open Addict. J. 3 (1), 92–101 (2010).

Article CAS Google Scholar

Bordnick, P. S. et al. Utilizing virtual reality to standardize nicotine craving research: A pilot study. Addict. Behav. 29 (9), 1889–1894. https://doi.org/10.1016/j.addbeh.2004.06.008 (2004).

Article PubMed Google Scholar

Freeman, D. et al. Automated psychological therapy using immersive virtual reality for treatment of fear of heights: A single-blind, parallel-group, randomised controlled trial. Lancet Psychiatry 5 (8), 625–632. https://doi.org/10.1016/S2215-0366(18)30226-8 (2018).

Article PubMed PubMed Central Google Scholar

Yu, C.-J., Rovira, A., Pan, X. & Freeman, D. A validation study to trigger nicotine craving in virtual reality. In 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) 868–869. https://doi.org/10.1109/VRW55335.2022.00285 (2022).

Herman, C. P. External and internal cues as determinants of the smoking behavior of light and heavy smokers. J. Pers. Soc. Psychol. 30 (5), 664–672. https://doi.org/10.1037/h0037440 (1974).

Conklin, C. A., Robin, N., Perkins, K. A., Salkeld, R. P. & McClernon, F. J. Proximal versus distal cues to smoke: The effects of environments on smokers’ cue-reactivity’. Exp. Clin. Psychopharmacol. 16 (3), 207–214. https://doi.org/10.1037/1064-1297.16.3.207 (2008).

Allen, S. S., Bade, T., Hatsukami, D. & Center, B. Craving, withdrawal, and smoking urges on days immediately prior to smoking relapse. Nicotine Tob. Res. 10 (1), 35–45. https://doi.org/10.1080/14622200701705076 (2008).

Conklin, C. A. et al. Examining the relationship between cue-induced craving and actual smoking. Exp. Clin. Psychopharmacol. 23 (2), 90–96. https://doi.org/10.1037/a0038826 (2015).

Wray, J. M., Gass, J. C. & Tiffany, S. T. A systematic review of the relationships between craving and smoking cessation. Nicotine Tob. Res. 15 (7), 1167–1182. https://doi.org/10.1093/ntr/nts268 (2013).

Mintz, J., Boyd, G., Rose, J. E., Charuvastra, V. C. & Jarvik, M. E. Alcohol increases cigarette smoking: A laboratory demonstration. Addict. Behav. 10 (3), 203–207. https://doi.org/10.1016/0306-4603(85)90001-2 (1985).

Bien, T. H. & Burge, R. Smoking and drinking: A review of the literature. Int. J. Addict. 25 (12), 1429–1454. https://doi.org/10.3109/10826089009056229 (1990).

Robbins, S. J. & Ehrman, R. N. Designing studies of drug conditioning in humans. Psychopharmacology (Berl.) 106 (2), 143–153. https://doi.org/10.1007/BF02801965 (1992).

Lochbuehler, K., Engels, R. C. M. E. & Scholte, R. H. J. Influence of smoking cues in movies on craving among smokers. Addiction 104 (12), 2102–2109. https://doi.org/10.1111/j.1360-0443.2009.02712.x (2009).

Upadhyaya, H. P., Drobes, D. J. & Thomas, S. E. Reactivity to smoking cues in adolescent cigarette smokers. Addict. Behav. 29 (5), 849–856. https://doi.org/10.1016/j.addbeh.2004.02.040 (2004).

Hines, D., Saris, R. N. & Throckmorton-Belzer, L. Cigarette smoking in popular films: Does it increase viewers’ likelihood to smoke?. J. Appl. Soc. Psychol. 30 (11), 2246–2269. https://doi.org/10.1111/j.1559-1816.2000.tb02435.x (2000).

Article Google Scholar

Bordnick, P. S., Graap, K. M., Copp, H. L., Brooks, J. & Ferrer, M. Virtual reality cue reactivity assessment in cigarette smokers. Cyberpsychol. Behav. 8 (5), 487–492. https://doi.org/10.1089/cpb.2005.8.487 (2005).

Sanchez-Vives, M. V. & Slater, M. From presence to consciousness through virtual reality. Nat. Rev. Neurosci. 6 (4), 332–339. https://doi.org/10.1038/nrn1651 (2005).

Traylor, A. C., Parrish, D. E., Copp, H. L. & Bordnick, P. S. Using virtual reality to investigate complex and contextual cue reactivity in nicotine dependent problem drinkers. Addict. Behav. 36 (11), 1068–1075. https://doi.org/10.1016/j.addbeh.2011.06.014 (2011).

Pericot-Valverde, I., Germeroth, L. J. & Tiffany, S. T. The use of virtual reality in the production of cue-specific craving for cigarettes: A meta-analysis. Nicotine Tob. Res. 18 (5), 538–546. https://doi.org/10.1093/ntr/ntv216 (2016).

Pericot-Valverde, I., Secades-Villa, R., Gutiérrez-Maldonado, J. & García-Rodríguez, O. Effects of systematic cue exposure through virtual reality on cigarette craving. Nicotine Tob. Res. 16 (11), 1470–1477. https://doi.org/10.1093/ntr/ntu104 (2014).

Article CAS PubMed PubMed Central Google Scholar

Zandonai, T. et al. A virtual reality study on postretrieval extinction of smoking memory reconsolidation in smokers. J. Subst. Abuse Treat. 125 , 108317. https://doi.org/10.1016/j.jsat.2021.108317 (2021).

Pericot-Valverde, I., Secades-Villa, R. & Gutiérrez-Maldonado, J. A randomized clinical trial of cue exposure treatment through virtual reality for smoking cessation. J. Subst. Abuse Treat. 96 , 26–32. https://doi.org/10.1016/j.jsat.2018.10.003 (2019).

Machulska, A. et al. Approach bias retraining through virtual reality in smokers willing to quit smoking: A randomized-controlled study. Behav. Res. Ther. 141 , 103858. https://doi.org/10.1016/j.brat.2021.103858 (2021).

Goldenhersch, E. et al. Virtual reality smartphone-based intervention for smoking cessation: Pilot randomized controlled trial on initial clinical efficacy and adherence. J. Med. Internet Res. 22 (7), e17571. https://doi.org/10.2196/17571 (2020).

Culbertson, C. S., Shulenberger, S., De La Garza, R., Newton, T. F. & Brody, A. L. Virtual reality cue exposure therapy for the treatment of tobacco dependence. J. Cyber Ther. Rehabil. 5 (1), 57–64 (2012).

PubMed PubMed Central Google Scholar

Jerome, L. W., Jordan, P. J., Rodericks, R. & Fedenczuk, L. Psychophysiological arousal and craving in smokers, deprived smokers, former smokers, and non-smokers. Stud. Health Technol. Inform. 144 , 179–183 (2009).

PubMed Google Scholar

LaRowe, S. D., Saladin, M. E., Carpenter, M. J. & Upadhyaya, H. P. Reactivity to nicotine cues over repeated cue reactivity sessions. Addict. Behav. 32 (12), 2888–2899. https://doi.org/10.1016/j.addbeh.2007.04.025 (2007).

Heishman, S., Singleton, E. & Pickworth, W. Reliability and validity of a short form of the tobacco craving questionnaire. Nicotine Tob. Res. 10 (4), 643–651. https://doi.org/10.1080/14622200801908174 (2008).

Heishman, S., Singleton, E. & Moolchan, E. Tobacco Craving Questionnaire: Reliability and validity of a new multifactorial instrument. Nicotine Tob. Res. 5 (5), 645–654. https://doi.org/10.1080/1462220031000158681 (2003).

Heller, G. Z., Manuguerra, M. & Chow, R. How to analyze the Visual Analogue Scale: Myths, truths and clinical relevance. Scand. J. Pain 13 (1), 67–75. https://doi.org/10.1016/j.sjpain.2016.06.012 (2016).

Wewers, M. E., Rachfal, C. & Ahijevych, K. A psychometric evaluation of a visual analogue scale of craving for cigarettes. West. J. Nurs. Res. 12 (5), 672–681 (1990).

Heatherton, T. F., Kozlowski, L. T., Frecker, R. C., Rickert, W. & Robinson, J. Measuring the Heaviness of Smoking: Using self-reported time to the first cigarette of the day and number of cigarettes smoked per day. Br. J. Addict. 84 (7), 791–800. https://doi.org/10.1111/j.1360-0443.1989.tb03059.x (1989).

Borland, R., Yong, H.-H., O’Connor, R. J., Hyland, A. & Thompson, M. E. The reliability and predictive validity of the Heaviness of Smoking Index and its two components: Findings from the International Tobacco Control Four Country study. Nicotine Tob. Res. 12 (Supplement 1), S45–S50. https://doi.org/10.1093/ntr/ntq038 (2010).

Bush, K. The AUDIT alcohol consumption questions (AUDIT-C): An effective brief screening test for problem drinking. Arch. Intern. Med. 158 (16), 1789. https://doi.org/10.1001/archinte.158.16.1789 (1998).

Khadjesari, Z. et al. Validation of the AUDIT-C in adults seeking help with their drinking online. Addict. Sci. Clin. Pract. 12 (1), 2. https://doi.org/10.1186/s13722-016-0066-5 (2017).

Freeman, D. et al. Automated VR therapy for improving positive self-beliefs and psychological well-being in young patients with psychosis: A proof of concept evaluation of Phoenix VR self-confidence therapy. Behav. Cogn. Psychother. https://doi.org/10.1017/S1352465823000553 (2023).

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2023). https://www.R-project.org/

Benedek, M. & Kaernbach, C. A continuous measure of phasic electrodermal activity. J. Neurosci. Methods 190 (1), 80–91. https://doi.org/10.1016/j.jneumeth.2010.04.028 (2010).

Horvers, A., Tombeng, N., Bosse, T., Lazonder, A. W. & Molenaar, I. Detecting emotions through electrodermal activity in learning contexts: A systematic review. Sensors 21 (23), 7869. https://doi.org/10.3390/s21237869 (2021).

Article ADS PubMed PubMed Central Google Scholar

Anderson, A. P. et al. Relaxation with immersive natural scenes presented using virtual reality. Aerosp. Med. Hum. Perform. 88 (6), 520–526. https://doi.org/10.3357/AMHP.4747.2017 (2017).

McCarthy, C., Pradhan, N., Redpath, C. & Adler, A. Validation of the Empatica E4 wristband. 2016 IEEE EMBS Int. Stud. Conf. Expand. Boundaries Biomed. Eng. Healthc. ISC 2016 – Proc. 4–7. https://doi.org/10.1109/EMBSISC.2016.7508621 (2016).

Schuurmans, A. A. T. et al. Validity of the empatica E4 wristband to measure heart rate variability (HRV) parameters: A comparison to electrocardiography (ECG). J. Med. Syst. 44 (11), 190. https://doi.org/10.1007/s10916-020-01648-w (2020).

Download references

Acknowledgements

This work was funded by a National Institute for Health and Care Research (NIHR) Senior Investigator award to DF (NIHR202385) and the NIHR Oxford Health Biomedical Research Centre (BRC). SL was supported by an NIHR Doctoral Fellowship (NIHR301483). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care. FW is funded by a Wellcome Trust Clinical Doctoral Fellowship (102176/B/13/Z).

Author information

Authors and affiliations.

Department of Experimental Psychology, University of Oxford, Oxford, UK

Aitor Rovira, Sinéad Lambe, Helen Beckwith, Memoona Ahmed, Felicity Hudson, Phoebe Haynes, Kira Williams, Simone Saidel, Ellen Iredale, Sapphira McBride, Felicity Waite & Daniel Freeman

Oxford Health NHS Foundation Trust, Oxford, UK

Aitor Rovira, Sinéad Lambe, Helen Beckwith, Felicity Waite & Daniel Freeman

Goldsmiths University, London, UK

Chun-Jou Yu & Xueni Pan

You can also search for this author in PubMed Google Scholar

Contributions

DF, SL, and AR conceived the study. DF, SL, AR, KW, SS, EI, SM conducted the experimental design. CYU programmed the VR experience. AR, XP supervised the software development. MA, FH, PH carried out the recruitment and the VR sessions. SL, HB, FW helped supervise the study. AR, DF performed the statistical analysis. AR, DF wrote the manuscript.

Corresponding author

Correspondence to Aitor Rovira .

Ethics declarations

Competing interests.

DF is the scientific founder of Oxford VR, a University of Oxford spin-out company. Oxford VR has not been involved in this study. The rest of the authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Rovira, A., Lambe, S., Beckwith, H. et al. A randomized controlled experiment testing the use of virtual reality to trigger cigarette craving in people who smoke. Sci Rep 14 , 19445 (2024). https://doi.org/10.1038/s41598-024-70113-2

Download citation

Received : 19 April 2024

Accepted : 13 August 2024

Published : 21 August 2024

DOI : https://doi.org/10.1038/s41598-024-70113-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Cigarette craving
Virtual reality

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

COMMENTS

Randomized experiments & causality
Only establishing causality in a rigorous fashion gives you the right to use the word "because". The four methods for causality estimation we will look at are: Randomized experiments. Instrumental variables. Regression discontinuity. Difference-in-differences. This first part of the series focuses on the golden standard in science ...
Methods for Evaluating Causality in Observational Studies
In many scientific disciplines, causality must be demonstrated by an experiment. In clinical medical research, this purpose is achieved with a randomized controlled trial (RCT) ( 4 ).
PDF STATS 361: Causal Inference
14 Adaptive Experiments 111 1. Lecture 1 Randomized Controlled Trials Randomized controlled trials (RCTs) form the foundation of statistical causal inference. When available, evidence drawn from RCTs is often considered gold ... In particular, large randomized experiments let us recover the average treatment e ect (ATE) ˝= E[Y i(1) Y i(0)]: (1.2)
Stat 20
An experiment is generally characterized as being a setting where the researcher actively assigns subjects or units to one particular condition or another. The most potent design of an experiment to determine whether one variable, the treatment, affects the outcome at the group level is the aptly named Randomized Controlled Trial (RCT).
02
The Ideal Experiment Randomised experiments or Randomised Controlled Trials (RCT) are the most reliable way to get causal effects. It's a straightforward technique and absurdly convincing. It is so powerful that most countries have it as a requirement for showing the effectiveness of new medicine.
Causation and Experiments
A randomized controlled double-blind experiment is generally optimal for establishing causation. A lack of realism may prevent researchers from generalizing experimental results to real-life situations.
Chapter 7 Randomization and Causality
7.2 Randomized experiments The gold standard for understanding causality is the randomized experiment. For the sake of this chapter, we will focus on experiments in which people are randomized to one of two conditions: treatment or control.
Introduction to Fundamental Concepts in Causal Inference
Causal inference refers to the design and analysis of data for uncovering causal relationships between treatment/intervention variables and outcome variables. We care about causal inference because a large proportion of real-life questions of interest are questions of causality, not correlation. Causality has been of concern since the dawn of ...
Causal Inference: Basic Concepts and Randomized Experiments
Concepts Covered Today Association versus causation Defining causal quantities with counterfactual/potential outcomes Connection to missing data Identification of the average treatment effect in a completely randomized experiment Covariate balance Does daily smoking cause a decrease in lung function?
The role of causal criteria in causal inferences: Bradford Hill's
As Hernán notes, in "ideal randomized experiments, association is causation" [ 31 ]. Hernán's claim that in idealized randomized experiments, "association is causation", is a contemporary restatement of a view presented earlier by the English statistician and geneticist R. A. Fisher.
PDF Causal Inference: A Tutorial
Randomized Experiments In randomized experiments, assignment mechanism is known and controlled by investigators Strong ignorability automatically holds Randomization does: balance observed covariates balance unobserved covariates
PDF 14.310x Spring 2023 Lecture 14: Causality
Thinking about causality: the potential outcome framework This model is due to the Harvard Statistician Donald Rubin I find it very useful to think about randomized controlled trials (which will occupy it for a couple of lectures) and about causality more generally.
Randomized Controlled Trials
Randomized controlled trials (RCTs) are considered the highest level of evidence to establish causal associations in clinical research. There are many RCT designs and features that can be selected to address a research hypothesis. Designs of RCTs have ...
PDF Causation and Experimental Design
By the end of the chapter, you should have a good grasp of the meaning of causation and the logic of experimental design. Most social research, both academic and applied, uses data collection methods other than experiments.
Randomised controlled trials—the gold standard for effectiveness
Randomised controlled trials—the gold standard for effectiveness research. Randomized controlled trials (RCT) are prospective studies that measure the effectiveness of a new intervention or treatment. Although no study is likely on its own to prove causality, randomization reduces bias and provides a rigorous tool to examine cause-effect ...
Causal inference in randomized clinical trials
Causal inference We begin with some basic concepts in causal inference illustrating why it is wrong to draw conclusions regarding causality using treatment effect from simple group comparisons.
What are randomised controlled trials good for?
Randomized controlled trials (RCTs) are widely taken as the gold standard for establishing causal conclusions. Ideally conducted they ensure that the treatment 'causes' the outcome—in the experiment. But where else? This is the venerable question of external validity. I point out that the question comes in two importantly different forms: Is the specific causal conclusion warranted by ...
Causation in Statistics: Hill's Criteria
Randomized experiments are the best way to identify causal relationships. Experimenters control the treatment (or factors involved), randomly assign the subjects, and help manage other sources of variation.
PDF Why Experimenters Might Not Always Want to Randomize, and What They
1 Introduction Experiments, and in particular randomized experiments, are the conceptual reference point that gives empirical content to the notion of causality. In recent years, actual randomized experiments have become increasingly popular elements of the methodological toolbox in a wide range of social science disciplines.
How to Prove Causation
The Ideal Way: Random Experiments. The purest way to establish causation is through a randomized controlled experiment (like an A/B test) where you have two groups — one gets the treatment, one doesn't. The critical assumption is that the two groups are homogenous — meaning that there are no systematic differences between the two groups ...
Randomized experiment
In science, randomized experiments are the experiments that allow the greatest reliability and validity of statistical estimates of treatment effects. Randomization-based inference is especially important in experimental design and in survey sampling .
Estimating the causal effect of measured endogenous variables: A
One example is the belief that randomized experiments - the "gold standard" of determining causality - are effective in all contexts. Experiments are indeed a unique means of establishing the causal effects of variables that can be manipulated, such as leadership training, incentive structures, and different levels of power.
Quasi-experimental causality in neuroscience and behavioural ...
Abstract. In many scientific domains, causality is the key question. For example, in neuroscience, we might ask whether a medication affects perception, cognition or action. Randomized controlled ...
A randomized controlled experiment testing the use of virtual ...
Experimental design. The study was a between-subject experiment in which we carried out a between group comparison of smoking craving scores after going through a VR experience, either a neutral ...

Causal Inference for the Brave and True

In a School Far, Far Away #

The Ideal Experiment #

The Assignment Mechanism #

Key Ideas #

References #

Contribute #

Introduction to Statistics and Data Science

Chapter 7 Randomization and Causality

Needed Packages

7.1 Causal Questions

7.2 Randomized experiments

7.2.1 Random processes in R

7.3 Omitted variables

7.4 The magic of randomization

7.4.1 Randomization Example

7.4.2 Estimating the treatment effect

7.5 If you know Z, what about multiple regression?

7.6 What if you don’t know Z?

7.7 Conclusion

Other Formats

Concepts Covered Today

Does daily smoking cause a decrease in lung function?

Association of Smoking and Lung Function

Definition of Association

Defining Causation: Parallel Universe Analogy

Counterfactual/Potential Outcomes

Causal Estimands

Counterfactual Data Versus Observed Data

The Main Problem of Causal Inference

Causal Identification: SUTVA or Causal Consistency

No Multiple Versions of Treatment

No Interference

Causal Identification and Missing Data

Assumption on Missingness Pattern

Formal Statement of MCAR

Formal Proof of Causal Identification of \(\mathbb{E}[Y(1)]\)

Causal Identification of the ATE

Formal Proof of Causal Identification of the ATE

Why Randomized Experiments Identify Causal Effects

RCTs with Covariates

Causal Identification of The ATE with Covariates

Covariate Balance

RCT Balances Measured and Unmeasured Covariates

Randomization Creates Comparable Groups

Note About Pre-treatment Covariates

The role of causal criteria in causal inferences: Bradford Hill's "aspects of association"

Introduction

Acknowledgements

Author information

Corresponding author

Additional information

Rights and permissions

About this article

Share this article

Epidemiologic Perspectives & Innovations

Randomised controlled trials—the gold standard for effectiveness research

Joseph J. Locascio

USEFUL RESOURCES

LEARNING POINTS

What are randomised controlled trials good for?

Cite this article

Similar content being viewed by others

Design and Analysis of Experiments

The Virtues and Limitations of Randomized Experiments

Inferential Pluralism in Causal Reasoning from Randomized Experiments

1 Introduction

2 The probabilistic theory of causality

3 RCTs and the probabilistic theory of causality

4 Causal principles: from experimental to target populations

5 From causal principles to policy predictions

Acknowledgements

Open Access

Author information

Corresponding author

Rights and permissions

About this article

Share this article

Causation in Statistics: Hill’s Criteria

Relationships and Correlation vs. Causation