• Encyclopedia ›

Definition Experiment

Experiments investigate and attempt to demonstrate the cause and effect relationship between two variables. An example can be seen in the test phase of pharmaceutical drugs , i.e., whether drug X effectively combats disease Y.

In experiments, the subjects are usually divided into two groups ‒ one control and one experimental group. The experimental group actually receives the drug while the control group only proceeds with the standard treatment. A distinction is made between laboratory (controlled environment) and field experiments (in natural settings). Experiments must satisfy the scientific quality criteria of objectivity , reliability  , and validity .

Please note that the definitions in our statistics encyclopedia are simplified explanations of terms. Our goal is to make the definitions accessible for a broad audience; thus it is possible that some definitions do not adhere entirely to scientific standards.

  • Extreme value
  • Extrapolation
  • Exogenous variable
  • Endogenous variable
  • Empirical probability
  • Elementary event
  • Ecological fallacy

JMP | Statistical Discovery.™ From SAS.

Statistics Knowledge Portal

A free online introduction to statistics

Design of experiments

What is design of experiments.

Design of experiments (DOE) is a systematic, efficient method that enables scientists and engineers to study the relationship between multiple input variables (aka factors) and key output variables (aka responses). It is a structured approach for collecting data and making discoveries.

When to use DOE?

  • To determine whether a factor, or a collection of factors, has an effect on the response.
  • To determine whether factors interact in their effect on the response.
  • To model the behavior of the response as a function of the factors.
  • To optimize the response.

Ronald Fisher first introduced four enduring principles of DOE in 1926: the factorial principle, randomization, replication and blocking. Generating and analyzing these designs relied primarily on hand calculation in the past; until recently practitioners started using computer-generated designs for a more effective and efficient DOE.

Why use DOE?

DOE is useful:

  • In driving knowledge of cause and effect between factors.
  • To experiment with all factors at the same time.
  • To run trials that span the potential experimental region for our factors.
  • In enabling us to understand the combined effect of the factors.

To illustrate the importance of DOE, let’s look at what will happen if DOE does NOT exist.

Experiments are likely to be carried out via trial and error or one-factor-at-a-time (OFAT) method.

Trial-and-error method

Test different settings of two factors and see what the resulting yield is.

Say we want to determine the optimal temperature and time settings that will maximize yield through experiments.

How the experiment looks like using trial-and-error method:

1. Conduct a trial at starting values for the two variables and record the yield:

trial-starting-value

2. Adjust one or both values based on our results:

adjust-values

3. Repeat Step 2 until we think we've found the best set of values:

best-set-of-values

As you can tell, the  cons of trial-and-error  are:

  • Inefficient, unstructured and ad hoc (worst if carried out without subject matter knowledge).
  • Unlikely to find the optimum set of conditions across two or more factors.

One factor at a time (OFAT) method

Change the value of the one factor, then measure the response, repeat the process with another factor.

In the same experiment of searching optimal temperature and time to maximize yield, this is how the experiment looks using an OFAT method:

1. Start with temperature: Find the temperature resulting in the highest yield, between 50 and 120 degrees.

    1a. Run a total of eight trials. Each trial increases temperature by 10 degrees (i.e., 50, 60, 70 ... all the way to 120 degrees).

    1b. With time fixed at 20 hours as a controlled variable.

    1c. Measure yield for each batch.

statistical experiment definition

2. Run the second experiment by varying time, to find the optimal value of time (between 4 and 24 hours).

    2a. Run a total of six trials. Each trial increases temperature by 4 hours (i.e., 4, 8, 12… up to 24 hours).

    2b. With temperature fixed at 90 degrees as a controlled variable.

    2c. Measure yield for each batch.

statistical experiment definition

3. After a total of 14 trials, we’ve identified the max yield (86.7%) happens when:

  • Temperature is at 90 degrees; Time is at 12 hours.

statistical experiment definition

As you can already tell, OFAT is a more structured approach compared to trial and error.

But there’s one major problem with OFAT : What if the optimal temperature and time settings look more like this?

what-if-optimal-settings

We would have missed out acquiring the optimal temperature and time settings based on our previous OFAT experiments.

Therefore,  OFAT’s con  is:

  • We’re unlikely to find the optimum set of conditions across two or more factors.

How our trial and error and OFAT experiments look:

statistical experiment definition

Notice that none of them has trials conducted at a low temperature and time AND near optimum conditions.

What went wrong in the experiments?

  • We didn't simultaneously change the settings of both factors.
  • We didn't conduct trials throughout the potential experimental region.

statistical experiment definition

The result was a lack of understanding on the combined effect of the two variables on the response. The two factors did interact in their effect on the response!

A more effective and efficient approach to experimentation is to use statistically designed experiments (DOE).

Apply Full Factorial DOE on the same example

1. Experiment with two factors, each factor with two values. 

statistical experiment definition

These four trials form the corners of the design space:

statistical experiment definition

2. Run all possible combinations of factor levels, in random order to average out effects of lurking variables .

3. (Optional) Replicate entire design by running each treatment twice to find out experimental error :

replicated-factorial-experiment

4. Analyzing the results enable us to build a statistical model that estimates the individual effects (Temperature & Time), and also their interaction.

two-factor-interaction

It enables us to visualize and explore the interaction between the factors. An illustration of what their interaction looks like at temperature = 120; time = 4:

statistical experiment definition

You can visualize, explore your model and find the most desirable settings for your factors using the JMP Prediction Profiler .

Summary: DOE vs. OFAT/Trial-and-Error

  • DOE requires fewer trials.
  • DOE is more effective in finding the best settings to maximize yield.
  • DOE enables us to derive a statistical model to predict results as a function of the two factors and their combined effect.

Statistical Design and Analysis of Biological Experiments

Chapter 1 principles of experimental design, 1.1 introduction.

The validity of conclusions drawn from a statistical analysis crucially hinges on the manner in which the data are acquired, and even the most sophisticated analysis will not rescue a flawed experiment. Planning an experiment and thinking about the details of data acquisition is so important for a successful analysis that R. A. Fisher—who single-handedly invented many of the experimental design techniques we are about to discuss—famously wrote

To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ( Fisher 1938 )

(Statistical) design of experiments provides the principles and methods for planning experiments and tailoring the data acquisition to an intended analysis. Design and analysis of an experiment are best considered as two aspects of the same enterprise: the goals of the analysis strongly inform an appropriate design, and the implemented design determines the possible analyses.

The primary aim of designing experiments is to ensure that valid statistical and scientific conclusions can be drawn that withstand the scrutiny of a determined skeptic. Good experimental design also considers that resources are used efficiently, and that estimates are sufficiently precise and hypothesis tests adequately powered. It protects our conclusions by excluding alternative interpretations or rendering them implausible. Three main pillars of experimental design are randomization , replication , and blocking , and we will flesh out their effects on the subsequent analysis as well as their implementation in an experimental design.

An experimental design is always tailored towards predefined (primary) analyses and an efficient analysis and unambiguous interpretation of the experimental data is often straightforward from a good design. This does not prevent us from doing additional analyses of interesting observations after the data are acquired, but these analyses can be subjected to more severe criticisms and conclusions are more tentative.

In this chapter, we provide the wider context for using experiments in a larger research enterprise and informally introduce the main statistical ideas of experimental design. We use a comparison of two samples as our main example to study how design choices affect an analysis, but postpone a formal quantitative analysis to the next chapters.

1.2 A Cautionary Tale

For illustrating some of the issues arising in the interplay of experimental design and analysis, we consider a simple example. We are interested in comparing the enzyme levels measured in processed blood samples from laboratory mice, when the sample processing is done either with a kit from a vendor A, or a kit from a competitor B. For this, we take 20 mice and randomly select 10 of them for sample preparation with kit A, while the blood samples of the remaining 10 mice are prepared with kit B. The experiment is illustrated in Figure 1.1 A and the resulting data are given in Table 1.1 .

Table 1.1: Measured enzyme levels from samples of twenty mice. Samples of ten mice each were processed using a kit of vendor A and B, respectively.
A 8.96 8.95 11.37 12.63 11.38 8.36 6.87 12.35 10.32 11.99
B 12.68 11.37 12.00 9.81 10.35 11.76 9.01 10.83 8.76 9.99

One option for comparing the two kits is to look at the difference in average enzyme levels, and we find an average level of 10.32 for vendor A and 10.66 for vendor B. We would like to interpret their difference of -0.34 as the difference due to the two preparation kits and conclude whether the two kits give equal results or if measurements based on one kit are systematically different from those based on the other kit.

Such interpretation, however, is only valid if the two groups of mice and their measurements are identical in all aspects except the sample preparation kit. If we use one strain of mice for kit A and another strain for kit B, any difference might also be attributed to inherent differences between the strains. Similarly, if the measurements using kit B were conducted much later than those using kit A, any observed difference might be attributed to changes in, e.g., mice selected, batches of chemicals used, device calibration, or any number of other influences. None of these competing explanations for an observed difference can be excluded from the given data alone, but good experimental design allows us to render them (almost) arbitrarily implausible.

A second aspect for our analysis is the inherent uncertainty in our calculated difference: if we repeat the experiment, the observed difference will change each time, and this will be more pronounced for a smaller number of mice, among others. If we do not use a sufficient number of mice in our experiment, the uncertainty associated with the observed difference might be too large, such that random fluctuations become a plausible explanation for the observed difference. Systematic differences between the two kits, of practically relevant magnitude in either direction, might then be compatible with the data, and we can draw no reliable conclusions from our experiment.

In each case, the statistical analysis—no matter how clever—was doomed before the experiment was even started, while simple ideas from statistical design of experiments would have provided correct and robust results with interpretable conclusions.

1.3 The Language of Experimental Design

By an experiment we understand an investigation where the researcher has full control over selecting and altering the experimental conditions of interest, and we only consider investigations of this type. The selected experimental conditions are called treatments . An experiment is comparative if the responses to several treatments are to be compared or contrasted. The experimental units are the smallest subdivision of the experimental material to which a treatment can be assigned. All experimental units given the same treatment constitute a treatment group . Especially in biology, we often compare treatments to a control group to which some standard experimental conditions are applied; a typical example is using a placebo for the control group, and different drugs for the other treatment groups.

The values observed are called responses and are measured on the response units ; these are often identical to the experimental units but need not be. Multiple experimental units are sometimes combined into groupings or blocks , such as mice grouped by litter, or samples grouped by batches of chemicals used for their preparation. More generally, we call any grouping of the experimental material (even with group size one) a unit .

In our example, we selected the mice, used a single sample per mouse, deliberately chose the two specific vendors, and had full control over which kit to assign to which mouse. In other words, the two kits are the treatments and the mice are the experimental units. We took the measured enzyme level of a single sample from a mouse as our response, and samples are therefore the response units. The resulting experiment is comparative, because we contrast the enzyme levels between the two treatment groups.

Three designs to determine the difference between two preparation kits A and B based on four mice. A: One sample per mouse. Comparison between averages of samples with same kit. B: Two samples per mouse treated with the same kit. Comparison between averages of mice with same kit requires averaging responses for each mouse first. C: Two samples per mouse each treated with different kit. Comparison between two samples of each mouse, with differences averaged.

Figure 1.1: Three designs to determine the difference between two preparation kits A and B based on four mice. A: One sample per mouse. Comparison between averages of samples with same kit. B: Two samples per mouse treated with the same kit. Comparison between averages of mice with same kit requires averaging responses for each mouse first. C: Two samples per mouse each treated with different kit. Comparison between two samples of each mouse, with differences averaged.

In this example, we can coalesce experimental and response units, because we have a single response per mouse and cannot distinguish a sample from a mouse in the analysis, as illustrated in Figure 1.1 A for four mice. Responses from mice with the same kit are averaged, and the kit difference is the difference between these two averages.

By contrast, if we take two samples per mouse and use the same kit for both samples, then the mice are still the experimental units, but each mouse now groups the two response units associated with it. Now, responses from the same mouse are first averaged, and these averages are used to calculate the difference between kits; even though eight measurements are available, this difference is still based on only four mice (Figure 1.1 B).

If we take two samples per mouse, but apply each kit to one of the two samples, then the samples are both the experimental and response units, while the mice are blocks that group the samples. Now, we calculate the difference between kits for each mouse, and then average these differences (Figure 1.1 C).

If we only use one kit and determine the average enzyme level, then this investigation is still an experiment, but is not comparative.

To summarize, the design of an experiment determines the logical structure of the experiment ; it consists of (i) a set of treatments (the two kits); (ii) a specification of the experimental units (animals, cell lines, samples) (the mice in Figure 1.1 A,B and the samples in Figure 1.1 C); (iii) a procedure for assigning treatments to units; and (iv) a specification of the response units and the quantity to be measured as a response (the samples and associated enzyme levels).

1.4 Experiment Validity

Before we embark on the more technical aspects of experimental design, we discuss three components for evaluating an experiment’s validity: construct validity , internal validity , and external validity . These criteria are well-established in areas such as educational and psychological research, and have more recently been discussed for animal research ( Würbel 2017 ) where experiments are increasingly scrutinized for their scientific rationale and their design and intended analyses.

1.4.1 Construct Validity

Construct validity concerns the choice of the experimental system for answering our research question. Is the system even capable of providing a relevant answer to the question?

Studying the mechanisms of a particular disease, for example, might require careful choice of an appropriate animal model that shows a disease phenotype and is accessible to experimental interventions. If the animal model is a proxy for drug development for humans, biological mechanisms must be sufficiently similar between animal and human physiologies.

Another important aspect of the construct is the quantity that we intend to measure (the measurand ), and its relation to the quantity or property we are interested in. For example, we might measure the concentration of the same chemical compound once in a blood sample and once in a highly purified sample, and these constitute two different measurands, whose values might not be comparable. Often, the quantity of interest (e.g., liver function) is not directly measurable (or even quantifiable) and we measure a biomarker instead. For example, pre-clinical and clinical investigations may use concentrations of proteins or counts of specific cell types from blood samples, such as the CD4+ cell count used as a biomarker for immune system function.

1.4.2 Internal Validity

The internal validity of an experiment concerns the soundness of the scientific rationale, statistical properties such as precision of estimates, and the measures taken against risk of bias. It refers to the validity of claims within the context of the experiment. Statistical design of experiments plays a prominent role in ensuring internal validity, and we briefly discuss the main ideas before providing the technical details and an application to our example in the subsequent sections.

Scientific Rationale and Research Question

The scientific rationale of a study is (usually) not immediately a statistical question. Translating a scientific question into a quantitative comparison amenable to statistical analysis is no small task and often requires careful consideration. It is a substantial, if non-statistical, benefit of using experimental design that we are forced to formulate a precise-enough research question and decide on the main analyses required for answering it before we conduct the experiment. For example, the question: is there a difference between placebo and drug? is insufficiently precise for planning a statistical analysis and determine an adequate experimental design. What exactly is the drug treatment? What should the drug’s concentration be and how is it administered? How do we make sure that the placebo group is comparable to the drug group in all other aspects? What do we measure and what do we mean by “difference?” A shift in average response, a fold-change, change in response before and after treatment?

The scientific rationale also enters the choice of a potential control group to which we compare responses. The quote

The deep, fundamental question in statistical analysis is ‘Compared to what?’ ( Tufte 1997 )

highlights the importance of this choice.

There are almost never enough resources to answer all relevant scientific questions. We therefore define a few questions of highest interest, and the main purpose of the experiment is answering these questions in the primary analysis . This intended analysis drives the experimental design to ensure relevant estimates can be calculated and have sufficient precision, and tests are adequately powered. This does not preclude us from conducting additional secondary analyses and exploratory analyses , but we are not willing to enlarge the experiment to ensure that strong conclusions can also be drawn from these analyses.

Risk of Bias

Experimental bias is a systematic difference in response between experimental units in addition to the difference caused by the treatments. The experimental units in the different groups are then not equal in all aspects other than the treatment applied to them. We saw several examples in Section 1.2 .

Minimizing the risk of bias is crucial for internal validity and we look at some common measures to eliminate or reduce different types of bias in Section 1.5 .

Precision and Effect Size

Another aspect of internal validity is the precision of estimates and the expected effect sizes. Is the experimental setup, in principle, able to detect a difference of relevant magnitude? Experimental design offers several methods for answering this question based on the expected heterogeneity of samples, the measurement error, and other sources of variation: power analysis is a technique for determining the number of samples required to reliably detect a relevant effect size and provide estimates of sufficient precision. More samples yield more precision and more power, but we have to be careful that replication is done at the right level: simply measuring a biological sample multiple times as in Figure 1.1 B yields more measured values, but is pseudo-replication for analyses. Replication should also ensure that the statistical uncertainties of estimates can be gauged from the data of the experiment itself, without additional untestable assumptions. Finally, the technique of blocking , shown in Figure 1.1 C, can remove a substantial proportion of the variation and thereby increase power and precision if we find a way to apply it.

1.4.3 External Validity

The external validity of an experiment concerns its replicability and the generalizability of inferences. An experiment is replicable if its results can be confirmed by an independent new experiment, preferably by a different lab and researcher. Experimental conditions in the replicate experiment usually differ from the original experiment, which provides evidence that the observed effects are robust to such changes. A much weaker condition on an experiment is reproducibility , the property that an independent researcher draws equivalent conclusions based on the data from this particular experiment, using the same analysis techniques. Reproducibility requires publishing the raw data, details on the experimental protocol, and a description of the statistical analyses, preferably with accompanying source code. Many scientific journals subscribe to reporting guidelines to ensure reproducibility and these are also helpful for planning an experiment.

A main threat to replicability and generalizability are too tightly controlled experimental conditions, when inferences only hold for a specific lab under the very specific conditions of the original experiment. Introducing systematic heterogeneity and using multi-center studies effectively broadens the experimental conditions and therefore the inferences for which internal validity is available.

For systematic heterogeneity , experimental conditions are systematically altered in addition to the treatments, and treatment differences estimated for each condition. For example, we might split the experimental material into several batches and use a different day of analysis, sample preparation, batch of buffer, measurement device, and lab technician for each batch. A more general inference is then possible if effect size, effect direction, and precision are comparable between the batches, indicating that the treatment differences are stable over the different conditions.

In multi-center experiments , the same experiment is conducted in several different labs and the results compared and merged. Multi-center approaches are very common in clinical trials and often necessary to reach the required number of patient enrollments.

Generalizability of randomized controlled trials in medicine and animal studies can suffer from overly restrictive eligibility criteria. In clinical trials, patients are often included or excluded based on co-medications and co-morbidities, and the resulting sample of eligible patients might no longer be representative of the patient population. For example, Travers et al. ( 2007 ) used the eligibility criteria of 17 random controlled trials of asthma treatments and found that out of 749 patients, only a median of 6% (45 patients) would be eligible for an asthma-related randomized controlled trial. This puts a question mark on the relevance of the trials’ findings for asthma patients in general.

1.5 Reducing the Risk of Bias

1.5.1 randomization of treatment allocation.

If systematic differences other than the treatment exist between our treatment groups, then the effect of the treatment is confounded with these other differences and our estimates of treatment effects might be biased.

We remove such unwanted systematic differences from our treatment comparisons by randomizing the allocation of treatments to experimental units. In a completely randomized design , each experimental unit has the same chance of being subjected to any of the treatments, and any differences between the experimental units other than the treatments are distributed over the treatment groups. Importantly, randomization is the only method that also protects our experiment against unknown sources of bias: we do not need to know all or even any of the potential differences and yet their impact is eliminated from the treatment comparisons by random treatment allocation.

Randomization has two effects: (i) differences unrelated to treatment become part of the ‘statistical noise’ rendering the treatment groups more similar; and (ii) the systematic differences are thereby eliminated as sources of bias from the treatment comparison.

Randomization transforms systematic variation into random variation.

In our example, a proper randomization would select 10 out of our 20 mice fully at random, such that the probability of any one mouse being picked is 1/20. These ten mice are then assigned to kit A, and the remaining mice to kit B. This allocation is entirely independent of the treatments and of any properties of the mice.

To ensure random treatment allocation, some kind of random process needs to be employed. This can be as simple as shuffling a pack of 10 red and 10 black cards or using a software-based random number generator. Randomization is slightly more difficult if the number of experimental units is not known at the start of the experiment, such as when patients are recruited for an ongoing clinical trial (sometimes called rolling recruitment ), and we want to have reasonable balance between the treatment groups at each stage of the trial.

Seemingly random assignments “by hand” are usually no less complicated than fully random assignments, but are always inferior. If surprising results ensue from the experiment, such assignments are subject to unanswerable criticism and suspicion of unwanted bias. Even worse are systematic allocations; they can only remove bias from known causes, and immediately raise red flags under the slightest scrutiny.

The Problem of Undesired Assignments

Even with a fully random treatment allocation procedure, we might end up with an undesirable allocation. For our example, the treatment group of kit A might—just by chance—contain mice that are all bigger or more active than those in the other treatment group. Statistical orthodoxy recommends using the design nevertheless, because only full randomization guarantees valid estimates of residual variance and unbiased estimates of effects. This argument, however, concerns the long-run properties of the procedure and seems of little help in this specific situation. Why should we care if the randomization yields correct estimates under replication of the experiment, if the particular experiment is jeopardized?

Another solution is to create a list of all possible allocations that we would accept and randomly choose one of these allocations for our experiment. The analysis should then reflect this restriction in the possible randomizations, which often renders this approach difficult to implement.

The most pragmatic method is to reject highly undesirable designs and compute a new randomization ( Cox 1958 ) . Undesirable allocations are unlikely to arise for large sample sizes, and we might accept a small bias in estimation for small sample sizes, when uncertainty in the estimated treatment effect is already high. In this approach, whenever we reject a particular outcome, we must also be willing to reject the outcome if we permute the treatment level labels. If we reject eight big and two small mice for kit A, then we must also reject two big and eight small mice. We must also be transparent and report a rejected allocation, so that critics may come to their own conclusions about potential biases and their remedies.

1.5.2 Blinding

Bias in treatment comparisons is also introduced if treatment allocation is random, but responses cannot be measured entirely objectively, or if knowledge of the assigned treatment affects the response. In clinical trials, for example, patients might react differently when they know to be on a placebo treatment, an effect known as cognitive bias . In animal experiments, caretakers might report more abnormal behavior for animals on a more severe treatment. Cognitive bias can be eliminated by concealing the treatment allocation from technicians or participants of a clinical trial, a technique called single-blinding .

If response measures are partially based on professional judgement (such as a clinical scale), patient or physician might unconsciously report lower scores for a placebo treatment, a phenomenon known as observer bias . Its removal requires double blinding , where treatment allocations are additionally concealed from the experimentalist.

Blinding requires randomized treatment allocation to begin with and substantial effort might be needed to implement it. Drug companies, for example, have to go to great lengths to ensure that a placebo looks, tastes, and feels similar enough to the actual drug. Additionally, blinding is often done by coding the treatment conditions and samples, and effect sizes and statistical significance are calculated before the code is revealed.

In clinical trials, double-blinding creates a conflict of interest. The attending physicians do not know which patient received which treatment, and thus accumulation of side-effects cannot be linked to any treatment. For this reason, clinical trials have a data monitoring committee not involved in the final analysis, that performs intermediate analyses of efficacy and safety at predefined intervals. If severe problems are detected, the committee might recommend altering or aborting the trial. The same might happen if one treatment already shows overwhelming evidence of superiority, such that it becomes unethical to withhold this treatment from the other patients.

1.5.3 Analysis Plan and Registration

An often overlooked source of bias has been termed the researcher degrees of freedom or garden of forking paths in the data analysis. For any set of data, there are many different options for its analysis: some results might be considered outliers and discarded, assumptions are made on error distributions and appropriate test statistics, different covariates might be included into a regression model. Often, multiple hypotheses are investigated and tested, and analyses are done separately on various (overlapping) subgroups. Hypotheses formed after looking at the data require additional care in their interpretation; almost never will \(p\) -values for these ad hoc or post hoc hypotheses be statistically justifiable. Many different measured response variables invite fishing expeditions , where patterns in the data are sought without an underlying hypothesis. Only reporting those sub-analyses that gave ‘interesting’ findings invariably leads to biased conclusions and is called cherry-picking or \(p\) -hacking (or much less flattering names).

The statistical analysis is always part of a larger scientific argument and we should consider the necessary computations in relation to building our scientific argument about the interpretation of the data. In addition to the statistical calculations, this interpretation requires substantial subject-matter knowledge and includes (many) non-statistical arguments. Two quotes highlight that experiment and analysis are a means to an end and not the end in itself.

There is a boundary in data interpretation beyond which formulas and quantitative decision procedures do not go, where judgment and style enter. ( Abelson 1995 )
Often, perfectly reasonable people come to perfectly reasonable decisions or conclusions based on nonstatistical evidence. Statistical analysis is a tool with which we support reasoning. It is not a goal in itself. ( Bailar III 1981 )

There is often a grey area between exploiting researcher degrees of freedom to arrive at a desired conclusion, and creative yet informed analyses of data. One way to navigate this area is to distinguish between exploratory studies and confirmatory studies . The former have no clearly stated scientific question, but are used to generate interesting hypotheses by identifying potential associations or effects that are then further investigated. Conclusions from these studies are very tentative and must be reported honestly as such. In contrast, standards are much higher for confirmatory studies, which investigate a specific predefined scientific question. Analysis plans and pre-registration of an experiment are accepted means for demonstrating lack of bias due to researcher degrees of freedom, and separating primary from secondary analyses allows emphasizing the main goals of the study.

Analysis Plan

The analysis plan is written before conducting the experiment and details the measurands and estimands, the hypotheses to be tested together with a power and sample size calculation, a discussion of relevant effect sizes, detection and handling of outliers and missing data, as well as steps for data normalization such as transformations and baseline corrections. If a regression model is required, its factors and covariates are outlined. Particularly in biology, handling measurements below the limit of quantification and saturation effects require careful consideration.

In the context of clinical trials, the problem of estimands has become a recent focus of attention. An estimand is the target of a statistical estimation procedure, for example the true average difference in enzyme levels between the two preparation kits. A main problem in many studies are post-randomization events that can change the estimand, even if the estimation procedure remains the same. For example, if kit B fails to produce usable samples for measurement in five out of ten cases because the enzyme level was too low, while kit A could handle these enzyme levels perfectly fine, then this might severely exaggerate the observed difference between the two kits. Similar problems arise in drug trials, when some patients stop taking one of the drugs due to side-effects or other complications.

Registration

Registration of experiments is an even more severe measure used in conjunction with an analysis plan and is becoming standard in clinical trials. Here, information about the trial, including the analysis plan, procedure to recruit patients, and stopping criteria, are registered in a public database. Publications based on the trial then refer to this registration, such that reviewers and readers can compare what the researchers intended to do and what they actually did. Similar portals for pre-clinical and translational research are also available.

1.6 Notes and Summary

The problem of measurements and measurands is further discussed for statistics in Hand ( 1996 ) and specifically for biological experiments in Coxon, Longstaff, and Burns ( 2019 ) . A general review of methods for handling missing data is Dong and Peng ( 2013 ) . The different roles of randomization are emphasized in Cox ( 2009 ) .

Two well-known reporting guidelines are the ARRIVE guidelines for animal research ( Kilkenny et al. 2010 ) and the CONSORT guidelines for clinical trials ( Moher et al. 2010 ) . Guidelines describing the minimal information required for reproducing experimental results have been developed for many types of experimental techniques, including microarrays (MIAME), RNA sequencing (MINSEQE), metabolomics (MSI) and proteomics (MIAPE) experiments; the FAIRSHARE initiative provides a more comprehensive collection ( Sansone et al. 2019 ) .

The problems of experimental design in animal experiments and particularly translation research are discussed in Couzin-Frankel ( 2013 ) . Multi-center studies are now considered for these investigations, and using a second laboratory already increases reproducibility substantially ( Richter et al. 2010 ; Richter 2017 ; Voelkl et al. 2018 ; Karp 2018 ) and allows standardizing the treatment effects ( Kafkafi et al. 2017 ) . First attempts are reported of using designs similar to clinical trials ( Llovera and Liesz 2016 ) . Exploratory-confirmatory research and external validity for animal studies is discussed in Kimmelman, Mogil, and Dirnagl ( 2014 ) and Pound and Ritskes-Hoitinga ( 2018 ) . Further information on pilot studies is found in Moore et al. ( 2011 ) , Sim ( 2019 ) , and Thabane et al. ( 2010 ) .

The deliberate use of statistical analyses and their interpretation for supporting a larger argument was called statistics as principled argument ( Abelson 1995 ) . Employing useless statistical analysis without reference to the actual scientific question is surrogate science ( Gigerenzer and Marewski 2014 ) and adaptive thinking is integral to meaningful statistical analysis ( Gigerenzer 2002 ) .

In an experiment, the investigator has full control over the experimental conditions applied to the experiment material. The experimental design gives the logical structure of an experiment: the units describing the organization of the experimental material, the treatments and their allocation to units, and the response. Statistical design of experiments includes techniques to ensure internal validity of an experiment, and methods to make inference from experimental data efficient.

statistical experiment definition

Theory of Statistical Experiments

  • © 1982

Mathematisches Institut, Universität Tübingen, Tübingen 1, West Germany

You can also search for this author in PubMed   Google Scholar

Part of the book series: Springer Series in Statistics (SSS)

3992 Accesses

40 Citations

This is a preview of subscription content, log in via an institution to check access.

Access this book

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

About this book

Similar content being viewed by others.

statistical experiment definition

Elements of Statistical Technique

statistical experiment definition

Large Sample Theory: The Basics

statistical experiment definition

Statistical science: a grammar for research

  • Markov kernel
  • Mathematica
  • Random variable
  • probability
  • probability theory
  • theory of statistics

Table of contents (10 chapters)

Front matter, games and statistical decisions, sufficient σ-algebras and statistics, sufficiency under additional assumptions, testing experiments, testing experiments admitting an isotone likelihood quotient, estimation experiments, information and sufficiency, invariance and the comparison of experiments, comparison of finite experiments, comparison with extremely informative experiments, back matter, authors and affiliations, bibliographic information.

Book Title : Theory of Statistical Experiments

Authors : H. Heyer

Series Title : Springer Series in Statistics

DOI : https://doi.org/10.1007/978-1-4613-8218-8

Publisher : Springer New York, NY

eBook Packages : Springer Book Archive

Copyright Information : Springer-Verlag New York Inc. 1982

Softcover ISBN : 978-1-4613-8220-1 Published: 12 October 2011

eBook ISBN : 978-1-4613-8218-8 Published: 06 December 2012

Series ISSN : 0172-7397

Series E-ISSN : 2197-568X

Edition Number : 1

Number of Pages : X, 289

Additional Information : Original German edition published in the series: Hochschultext

Topics : Applications of Mathematics

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Encyclopedia Britannica

  • History & Society
  • Science & Tech
  • Biographies
  • Animals & Nature
  • Geography & Travel
  • Arts & Culture
  • Games & Quizzes
  • On This Day
  • One Good Fact
  • New Articles
  • Lifestyles & Social Issues
  • Philosophy & Religion
  • Politics, Law & Government
  • World History
  • Health & Medicine
  • Browse Biographies
  • Birds, Reptiles & Other Vertebrates
  • Bugs, Mollusks & Other Invertebrates
  • Environment
  • Fossils & Geologic Time
  • Entertainment & Pop Culture
  • Sports & Recreation
  • Visual Arts
  • Demystified
  • Image Galleries
  • Infographics
  • Top Questions
  • Britannica Kids
  • Saving Earth
  • Space Next 50
  • Student Center
  • Introduction
  • Tabular methods
  • Graphical methods
  • Exploratory data analysis
  • Events and their probabilities
  • Random variables and probability distributions
  • The binomial distribution
  • The Poisson distribution
  • The normal distribution
  • Sampling and sampling distributions
  • Estimation of a population mean
  • Estimation of other parameters
  • Estimation procedures for two populations
  • Hypothesis testing
  • Bayesian methods

Analysis of variance and significance testing

Regression model, least squares method, analysis of variance and goodness of fit, significance testing.

  • Residual analysis
  • Model building
  • Correlation
  • Time series and forecasting
  • Nonparametric methods
  • Acceptance sampling
  • Statistical process control
  • Sample survey methods
  • Decision analysis

bar graph

Experimental design

Our editors will review what you’ve submitted and determine whether to revise the article.

  • Arizona State University - Educational Outreach and Student Services - Basic Statistics
  • Princeton University - Probability and Statistics
  • Statistics LibreTexts - Introduction to Statistics
  • University of North Carolina at Chapel Hill - The Writing Center - Statistics
  • Corporate Finance Institute - Statistics
  • statistics - Children's Encyclopedia (Ages 8-11)
  • statistics - Student Encyclopedia (Ages 11 and up)
  • Table Of Contents

Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine , biology , marketing research, and industrial production.

Recent News

In an experimental study, variables of interest are identified. One or more of these variables, referred to as the factors of the study , are controlled so that data may be obtained about how the factors influence another variable referred to as the response variable , or simply the response. As a case in point, consider an experiment designed to determine the effect of three different exercise programs on the cholesterol level of patients with elevated cholesterol. Each patient is referred to as an experimental unit , the response variable is the cholesterol level of the patient at the completion of the program, and the exercise program is the factor whose effect on cholesterol level is being investigated. Each of the three exercise programs is referred to as a treatment .

Three of the more widely used experimental designs are the completely randomized design, the randomized block design, and the factorial design. In a completely randomized experimental design, the treatments are randomly assigned to the experimental units. For instance, applying this design method to the cholesterol-level study, the three types of exercise program (treatment) would be randomly assigned to the experimental units (patients).

The use of a completely randomized design will yield less precise results when factors not accounted for by the experimenter affect the response variable. Consider, for example, an experiment designed to study the effect of two different gasoline additives on the fuel efficiency , measured in miles per gallon (mpg), of full-size automobiles produced by three manufacturers. Suppose that 30 automobiles, 10 from each manufacturer, were available for the experiment. In a completely randomized design the two gasoline additives (treatments) would be randomly assigned to the 30 automobiles, with each additive being assigned to 15 different cars. Suppose that manufacturer 1 has developed an engine that gives its full-size cars a higher fuel efficiency than those produced by manufacturers 2 and 3. A completely randomized design could, by chance , assign gasoline additive 1 to a larger proportion of cars from manufacturer 1. In such a case, gasoline additive 1 might be judged to be more fuel efficient when in fact the difference observed is actually due to the better engine design of automobiles produced by manufacturer 1. To prevent this from occurring, a statistician could design an experiment in which both gasoline additives are tested using five cars produced by each manufacturer; in this way, any effects due to the manufacturer would not affect the test for significant differences due to gasoline additive. In this revised experiment, each of the manufacturers is referred to as a block, and the experiment is called a randomized block design. In general, blocking is used in order to enable comparisons among the treatments to be made within blocks of homogeneous experimental units.

Factorial experiments are designed to draw conclusions about more than one factor, or variable. The term factorial is used to indicate that all possible combinations of the factors are considered. For instance, if there are two factors with a levels for factor 1 and b levels for factor 2, the experiment will involve collecting data on a b treatment combinations. The factorial design can be extended to experiments involving more than two factors and experiments involving partial factorial designs.

A computational procedure frequently used to analyze the data from an experimental study employs a statistical procedure known as the analysis of variance. For a single-factor experiment, this procedure uses a hypothesis test concerning equality of treatment means to determine if the factor has a statistically significant effect on the response variable. For experimental designs involving multiple factors, a test for the significance of each individual factor as well as interaction effects caused by one or more factors acting jointly can be made. Further discussion of the analysis of variance procedure is contained in the subsequent section.

Regression and correlation analysis

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables . A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation . Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

In simple linear regression , the model used to describe the relationship between a single dependent variable y and a single independent variable x is y = β 0 + β 1 x + ε. β 0 and β 1 are referred to as the model parameters, and ε is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x . If the error term were not present, the model would be deterministic; in that case, knowledge of the value of x would be sufficient to determine the value of y .

In multiple regression analysis , the model for simple linear regression is extended to account for the relationship between the dependent variable y and p independent variables x 1 , x 2 , . . ., x p . The general form of the multiple regression model is y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β p x p + ε. The parameters of the model are the β 0 , β 1 , . . ., β p , and ε is the error term.

Either a simple or multiple regression model is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. For simple linear regression, the least squares estimates of the model parameters β 0 and β 1 are denoted b 0 and b 1 . Using these estimates, an estimated regression equation is constructed: ŷ = b 0 + b 1 x . The graph of the estimated regression equation for simple linear regression is a straight line approximation to the relationship between y and x .

statistical experiment definition

As an illustration of regression analysis and the least squares method, suppose a university medical centre is investigating the relationship between stress and blood pressure . Assume that both a stress test score and a blood pressure reading have been recorded for a sample of 20 patients. The data are shown graphically in Figure 4 , called a scatter diagram . Values of the independent variable, stress test score, are given on the horizontal axis, and values of the dependent variable, blood pressure, are shown on the vertical axis. The line passing through the data points is the graph of the estimated regression equation: ŷ = 42.3 + 0.49 x . The parameter estimates, b 0 = 42.3 and b 1 = 0.49, were obtained using the least squares method.

A primary use of the estimated regression equation is to predict the value of the dependent variable when values for the independent variables are given. For instance, given a patient with a stress test score of 60, the predicted blood pressure is 42.3 + 0.49(60) = 71.7. The values predicted by the estimated regression equation are the points on the line in Figure 4 , and the actual blood pressure readings are represented by the points scattered about the line. The difference between the observed value of y and the value of y predicted by the estimated regression equation is called a residual . The least squares method chooses the parameter estimates such that the sum of the squared residuals is minimized.

A commonly used measure of the goodness of fit provided by the estimated regression equation is the coefficient of determination . Computation of this coefficient is based on the analysis of variance procedure that partitions the total variation in the dependent variable, denoted SST, into two parts: the part explained by the estimated regression equation, denoted SSR, and the part that remains unexplained, denoted SSE.

The measure of total variation, SST, is the sum of the squared deviations of the dependent variable about its mean: Σ( y − ȳ ) 2 . This quantity is known as the total sum of squares. The measure of unexplained variation, SSE, is referred to as the residual sum of squares. For the data in Figure 4 , SSE is the sum of the squared distances from each point in the scatter diagram (see Figure 4 ) to the estimated regression line: Σ( y − ŷ ) 2 . SSE is also commonly referred to as the error sum of squares. A key result in the analysis of variance is that SSR + SSE = SST.

The ratio r 2 = SSR/SST is called the coefficient of determination. If the data points are clustered closely about the estimated regression line, the value of SSE will be small and SSR/SST will be close to 1. Using r 2 , whose values lie between 0 and 1, provides a measure of goodness of fit; values closer to 1 imply a better fit. A value of r 2 = 0 implies that there is no linear relationship between the dependent and independent variables.

When expressed as a percentage , the coefficient of determination can be interpreted as the percentage of the total sum of squares that can be explained using the estimated regression equation. For the stress-level research study, the value of r 2 is 0.583; thus, 58.3% of the total sum of squares can be explained by the estimated regression equation ŷ = 42.3 + 0.49 x . For typical data found in the social sciences, values of r 2 as low as 0.25 are often considered useful. For data in the physical sciences, r 2 values of 0.60 or greater are frequently found.

In a regression study, hypothesis tests are usually conducted to assess the statistical significance of the overall relationship represented by the regression model and to test for the statistical significance of the individual parameters. The statistical tests used are based on the following assumptions concerning the error term: (1) ε is a random variable with an expected value of 0, (2) the variance of ε is the same for all values of x , (3) the values of ε are independent, and (4) ε is a normally distributed random variable.

The mean square due to regression, denoted MSR, is computed by dividing SSR by a number referred to as its degrees of freedom ; in a similar manner, the mean square due to error, MSE , is computed by dividing SSE by its degrees of freedom. An F-test based on the ratio MSR/MSE can be used to test the statistical significance of the overall relationship between the dependent variable and the set of independent variables. In general, large values of F = MSR/MSE support the conclusion that the overall relationship is statistically significant. If the overall model is deemed statistically significant, statisticians will usually conduct hypothesis tests on the individual parameters to determine if each independent variable makes a significant contribution to the model.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 6, introduction to experiment design.

  • The language of experiments
  • Principles of experiment design
  • Matched pairs experiment design
  • Experiment designs
  • Experiment design considerations

statistical experiment definition

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Video transcript

Logo for Pressbooks at Virginia Tech

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

1.4 Designed Experiments

Observational studies vs. experiments.

Ignoring anecdotal evidence, there are two primary types of data collection: observational studies and controlled (designed) experiments .  Remember, we typically cannot make claims of causality from observation studies because of the potential presence of confounding factors.  However, making causal conclusions based on experiments is often reasonable by controlling for those factors. Consider the following example:

Suppose you want to investigate the effectiveness of vitamin D in preventing disease. You recruit a group of subjects and ask them if they regularly take vitamin D. You notice that the subjects who take vitamin D exhibit better health on average than those who do not. Does this prove that vitamin D is effective in disease prevention? It does not. There are many differences between the two groups compared in addition to vitamin D consumption. People who take vitamin D regularly often take other steps to improve their health: exercise, diet, other vitamin supplements, choosing not to smoke. Any one of these factors could be influencing health. As described, this study does not necessarily prove that vitamin D is the key to disease prevention.

Experiments ultimately provide evidence to make decisions, so how could we narrow our focus and make claims of causality? In this section, you will learn important aspects of experimental design.

Designed Experiments

The purpose of an experiment is to investigate the relationship between two variables. When one variable causes change in another, we call the first variable the explanatory variable . The affected variable is called the response variable . In a randomized experiment, the researcher manipulates values of the explanatory variable and measures the resulting changes in the response variable. The different values of the explanatory variable may be called treatments . An experimental unit is a single object or individual to be measured. 

The main principles we want to follow in experimental design are:

Randomization

Replication.

In order to provide evidence that the explanatory variable is indeed causing the changes in the response variable, it is necessary to isolate the explanatory variable. The researcher must design their experiment in such a way that there is only one difference between groups being compared: the planned treatments. This is accomplished by randomization of experimental units to treatment groups. When subjects are assigned treatments randomly, all of the potential lurking variables are spread equally among the groups. At this point the only difference between groups is the one imposed by the researcher. Different outcomes measured in the response variable, therefore, must be a direct result of the different treatments. In this way, an experiment can show an apparent cause-and-effect connection between the explanatory and response variables.

Recall our previous example of investigating the effectiveness of vitamin D in preventing disease. Individuals in our trial could be randomly assigned, perhaps by flipping a coin, into one of two groups:  The control group (no treatment) and the second group receives extra doses of Vitamin D.

The more cases researchers observe, the more accurately they can estimate the effect of the explanatory variable on the response. In a single study, we replicate by collecting a sufficiently large sample. Additionally, a group of scientists may replicate an entire study to verify an earlier finding.  Having individuals experience a treatment more than once, called repeated measures is often helpful as well.

The power of suggestion can have an important influence on the outcome of an experiment. Studies have shown that the expectation of the study participant can be as important as the actual medication. In one study of performance-enhancing drugs, researchers noted:

Results showed that believing one had taken the substance resulted in [ performance ] times almost as fast as those associated with consuming the drug itself. In contrast, taking the drug without knowledge yielded no significant performance increment. [1]

It is often difficult to isolate the effects of the explanatory variable. To counter the power of suggestion, researchers set aside one treatment group as a control group . This group is given a placebo treatment–a treatment that cannot influence the response variable. The control group helps researchers balance the effects of being in an experiment with the effects of the active treatments. Of course, if you are participating in a study and you know that you are receiving a pill which contains no actual medication, then the power of suggestion is no longer a factor. Blinding in a randomized experiment preserves the power of suggestion. When a person involved in a research study is blinded, he does not know who is receiving the active treatment(s) and who is receiving the placebo treatment. A double-blind experiment is one in which both the subjects and the researchers involved with the subjects are blinded.

Randomized experiments are an essential tool in research. The US Food and Drug Administration typically requires that a new drug can only be marketed after two independently conducted randomized trials confirm its safety and efficacy; the European Medicines Agency has a similar policy. Large randomized experiments in medicine have provided the basis for major public health initiatives. In 1954 approximately 750,000 children participated in a randomized study comparing the polio vaccine with a placebo. In the United States, the results of the study quickly led to the widespread and successful use of the vaccine for polio prevention.

How does sleep deprivation affect your ability to drive? A recent study measured the effects on 19 professional drivers. Each driver participated in two experimental sessions: one after normal sleep and one after 27 hours of total sleep deprivation. The treatments were assigned in random order. In each session, performance was measured on a variety of tasks including a driving simulation.

The Smell & Taste Treatment and Research Foundation conducted a study to investigate whether smell can affect learning. Subjects completed mazes multiple times while wearing masks. They completed the pencil and paper mazes three times wearing floral-scented masks, and three times with unscented masks. Participants were assigned at random to wear the floral mask during the first three trials or during the last three trials. For each trial, researchers recorded the time it took to complete the maze and the subject’s impression of the mask’s scent: positive, negative, or neutral.

More Experimental Design

There are many different experimental designs from the most basic, a single treatment and control group, to some very complicated designs.  In an experimental design setting, when working with more than one variable, or treatment, they are often called factors , especially if it is categorical .  The values of factors are are often called levels .  When there are multiple factors, the combinations of each of the levels are called treatment combinations , or interactions.  Some basic ones you may see are:

  • Completely randomized
  • Block design
  • Matched pairs design

Completely Randomized

While very important and an essential research tool, not much explanation is needed for this design.  It involves figuring out how many treatments will be administered and randomly assigning participants to their respective groups.

Block Design 

Researchers sometimes know or suspect that variables, other than the treatment, influence the response. Under these circumstances, they may first group individuals based on this variable into blocks and then randomize cases within each block to the treatment groups. This strategy is often referred to as blocking. For instance, if we are looking at the effect of a drug on heart attacks, we might first split patients in the study into low-risk and high-risk blocks, then randomly assign half the patients from each block to the control group and the other half to the treatment group, as shown in the figure below. This strategy ensures each treatment group has an equal number of low-risk and high-risk patients.

Box labeled 'numbered patients' that has 54 blue or orange circles numbered from 1-54. Two arrows point from this box to 2 boxes below it with the caption 'create blocks'. The left box is all of the oragne cirlces grouped toegether labeled 'low-risk patients'. The right box is all of the blue circles grouped together labeled 'high-risk patients'. An arrow points down from the left box and the right box with the caption 'randomly split in half'. The arrows point to a 'Control' box and a 'Treatment' box. Both of these boxes have half orange circles and half blue circles.

Matched Pairs

A matched pairs design is one where we have very similar individuals (or even the same individual) receiving different two treatments (or treatment vs. control), then comparing their results.  This design is very powerful, however, it can be hard to find many like individuals to match up.  Some common ways of creating a matched pairs design are twin studies, before and after measurements,  pre and post test situations, or crossover studies.  Consider the following example:

In the 2000 Olympics, was the use of a new wetsuit design responsible for an observed increase in swim velocities? In a matched pairs study designed to investigate this question, twelve competitive swimmers swam 1500 meters at maximal speed, once wearing a wetsuit and once wearing a regular swimsuit. The order of wetsuit versus swimsuit was randomized for each of the 12 swimmers. Figure 1.6 shows the average velocity recorded for each swimmer, measured in meters per second (m/s).

Figure 1.6: Average Velocity of Swimmers
swimmer.number wet.suit.velocity swim.suit.velocity velocity.diff
1 1 1.57 1.49 0.08
2 2 1.47 1.37 0.10
3 3 1.42 1.35 0.07
4 4 1.35 1.27 0.08
5 5 1.22 1.12 0.10
6 6 1.75 1.64 0.11
7 7 1.64 1.59 0.05
8 8 1.57 1.52 0.05
9 9 1.56 1.50 0.06
10 10 1.53 1.45 0.08
11 11 1.49 1.44 0.05
12 12 1.51 1.41 0.10

Notice in this data, two sets of observations are uniquely paired so that an observation in one set matches an observation in the other; in this case, each swimmer has two measured velocities, one with a wetsuit and one with a swimsuit. A natural measure of the effect of the wetsuit on swim velocity is the difference between the measured maximum velocities (velocity.diff = wet.suit.velocity- swim.suit.velocity).  Even though there are two measurements per individual, using the difference in observations as the variable of interest allows for the problem to be analyzed.

A new windshield treatment claims to repel water more effectively. Ten windshields are tested by simulating rain without the new treatment. The same windshields are then treated, and the experiment is run again.  What experiment design is being implemented here?

A new medicine is said to help improve sleep. Eight subjects are picked at random and given the medicine. The means hours slept for each person were recorded before starting the medication and after. What experiment design is being implemented here?

Image References

Figure 1.5: Kindred Grey (2020). “Block Design.” CC BY-SA 4.0. Retrieved from https://commons.wikimedia.org/wiki/File:Block_Design.png

  • McClung, M. Collins, D. “Because I know it will!”: placebo effects of an ergogenic aid on athletic performance. Journal of Sport & Exercise Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 2013. ↵

Data collection where no variables are manipulated

Type of experiment where variables are manipulated; data is collected in a controlled setting

The independent variable in an experiment; the value controlled by researchers

The dependent variable in an experiment; the value that is measured for change at the end of an experiment

Different values or components of the explanatory variable applied in an experiment

Any individual or object to be measured

When an individual goes through a single treatment more than once

A group in a randomized experiment that receives no (or an inactive) treatment but is otherwise managed exactly as the other groups

An inactive treatment that has no real effect on the explanatory variable

Not telling participants which treatment they are receiving

The act of blinding both the subjects of an experiment and the researchers who work with the subjects

Variables in an experiment

Certain values of variables in an experiment

Combinations of levels of variables in an experiment

Dividing participants into treatment groups randomly

Grouping individuals based on a variable into "blocks" and then randomizing cases within each block to the treatment groups

Very similar individuals (or even the same individual) receive two different two treatments (or treatment vs. control) then the difference in results are compared

Significant Statistics Copyright © 2020 by John Morgan Russell, OpenStaxCollege, OpenIntro is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

What is an Observational Study: Definition & Examples

By Jim Frost 10 Comments

What is an Observational Study?

An observational study uses sample data to find correlations in situations where the researchers do not control the treatment, or independent variable, that relates to the primary research question. The definition of an observational study hinges on the notion that the researchers only observe subjects and do not assign them to the control and treatment groups. That’s the key difference between an observational study vs experiment. These studies are also known as quasi-experiments and correlational studies .

True experiments assign subject to the experimental groups where the researchers can manipulate the conditions. Unfortunately, random assignment is not always possible. For these cases, you can conduct an observational study.

In this post, learn about the types of observational studies, why they are susceptible to confounding variables, and how they compare to experiments. I’ll close this post by reviewing a published observational study about vitamin supplement usage.

Observational Study Definition

In an observational study, the researchers only observe the subjects and do not interfere or try to influence the outcomes. In other words, the researchers do not control the treatments or assign subjects to experimental groups. Instead, they observe and measure variables of interest and look for relationships between them. Usually, researchers conduct observational studies when it is difficult, impossible, or unethical to assign study participants to the experimental groups randomly. If you can’t randomly assign subjects to the treatment and control groups, then you observe the subjects in their self-selected states.

Observational Study vs Experiment

Randomized experiments provide better results than observational studies. Consequently, you should always use a randomized experiment whenever possible. However, if randomization is not possible, science should not come to a halt. After all, we still want to learn things, discover relationships, and make discoveries. For these cases, observational studies are a good alternative to a true experiment. Let’s compare the differences between an observational study vs. an experiment.

Random assignment in an experiment reduces systematic differences between experimental groups at the beginning of the study, which increases your confidence that the treatments caused any differences between groups you observe at the end of the study. In contrast, an observational study uses self-formed groups that can have pre-existing differences, which introduces the problem of confounding variables. More on that later!

In a randomized experiment, randomization tends to equalize confounders between groups and, thereby, prevents problems. In my post about random assignment , I describe that process as an elegant solution for confounding variables. You don’t need to measure or even know which variables are confounders, and randomization will still mitigate their effects. Additionally, you can use control variables in an experiment to keep the conditions as consistent as possible. For more detail about the differences, read Observational Study vs. Experiment .

Does not assign subjects to groups Randomly assigns subjects to control and treatment groups
Does not control variables that can affect outcome Administers treatments and controls influence of other variables
Correlational findings. Differences might be due to confounders rather than the treatment More confident that treatments cause the differences in outcomes

If you’re looking for a middle ground choice between observational studies vs experiments, consider using a quasi-experimental design. These methods don’t require you to randomly assign participants to the experimental groups and still allow you to draw better causal conclusions about an intervention than an observational study. Learn more about Quasi-Experimental Design Overview & Examples .

Related posts : Experimental Design: Definition and Examples , Randomized Controlled Trials (RCTs) , and Control Groups in Experiments

Observational Study Examples

Photograph of a person observing to illustrate an observational study.

Consider using an observational study when random assignment for an experiment is problematic. This approach allows us to proceed and draw conclusions about effects even though we can’t control the independent variables. The following observational study examples will help you understand when and why to use them.

For example, if you’re studying how depression affects performance of an activity, it’s impossible to assign subjects to the depression and control group randomly. However, you can have subjects with and without depression perform the activity and compare the results in an observational study.

Or imagine trying to assign subjects to cigarette smoking and non-smoking groups randomly?! However, you can observe people in both groups and assess the differences in health outcomes in an observational study.

Suppose you’re studying a treatment for a disease. Ideally, you recruit a group of patients who all have the disease, and then randomly assign them to the treatment and control group. However, it’s unethical to withhold the treatment, which rules out a control group. Instead, you can compare patients who voluntarily do not use the medicine to those who do use it.

In all these observational study examples, the researchers do not assign subjects to the experimental groups. Instead, they observe people who are already in these groups and compare the outcomes. Hence, the scientists must use an observational study vs. an experiment.

Types of Observational Studies

The observational study definition states that researchers only observe the outcomes and do not manipulate or control factors . Despite this limitation, there various types of observational studies.

The following experimental designs are three standard types of observational studies.

  • Cohort Study : A longitudinal observational study that follows a group who share a defining characteristic. These studies frequently determine whether exposure to risk factor affects an outcome over time.
  • Case-Control Study : A retrospective observational study that compares two existing groups—the case group with the condition and the control group without it. Researchers compare the groups looking for potential risk factors for the condition.
  • Cross-Sectional Study : Takes a snapshot of a moment in time so researchers can understand the prevalence of outcomes and correlations between variables at that instant.

Qualitative research studies are usually observational in nature, but they collect non-numeric data and do not perform statistical analyses.

Retrospective studies must be observational.

Later in this post, we’ll closely examine a quantitative observational study example that assesses vitamin supplement consumption and how that affects the risk of death. It’s possible to use random assignment to place each subject in either the vitamin treatment group or the control group. However, the study assesses vitamin consumption in 40,000 participants over the course of two decades. It’s unrealistic to enforce the treatment and control protocols over such a long time for so many people!

Drawbacks of Observational Studies

While observational studies get around the inability to assign subjects randomly, this approach opens the door to the problem of confounding variables. A confounding variable, or confounder, correlates with both the experimental groups and the outcome variable. Because there is no random process that equalizes the experimental groups in an observational study, confounding variables can systematically differ between groups when the study begins. Consequently, confounders can be the actual cause for differences in outcome at the end of the study rather than the primary variable of interest. If an experiment does not account for confounding variables, confounders can bias the results and create spurious correlations .

Performing an observational study can decrease the internal validity of your study but increase the external validity. Learn more about internal and external validity .

Let’s see how this works. Imagine an observational study that compares people who take vitamin supplements to those who do not. People who use vitamin supplements voluntarily will tend to have other healthy habits that exist at the beginning of the study. These healthy habits are confounding variables. If there are differences in health outcomes at the end of the study, it’s possible that these healthy habits actually caused them rather than the vitamin consumption itself. In short, confounders confuse the results because they provide alternative explanations for the differences.

Despite the limitations, an observational study can be a valid approach. However, you must ensure that your research accounts for confounding variables. Fortunately, there are several methods for doing just that!

Learn more about Correlation vs. Causation: Understanding the Differences .

Accounting for Confounding Variables in an Observational Study

Because observational studies don’t use random assignment, confounders can be distributed disproportionately between conditions. Consequently, experimenters need to know which variables are confounders, measure them, and then use a method to account for them. It involves more work, and the additional measurements can increase the costs. And there’s always a chance that researchers will fail to identify a confounder, not account for it, and produce biased results. However, if randomization isn’t an option, then you probably need to consider an observational study.

Trait matching and statistically controlling confounders using multivariate procedures are two standard approaches for incorporating confounding variables.

Related post : Causation versus Correlation in Statistics

Matching in Observational Studies

Photograph of matching babies.

Matching is a technique that involves selecting study participants with similar characteristics outside the variable of interest or treatment. Rather than using random assignment to equalize the experimental groups, the experimenters do it by matching observable characteristics. For every participant in the treatment group, the researchers find a participant with comparable traits to include in the control group. Matching subjects facilitates valid comparisons between those groups. The researchers use subject-area knowledge to identify characteristics that are critical to match.

For example, a vitamin supplement study using matching will select subjects who have similar health-related habits and attributes. The goal is that vitamin consumption will be the primary difference between the groups, which helps you attribute differences in health outcomes to vitamin consumption. However, the researchers are still observing participants who decide whether they consume supplements.

Matching has some drawbacks. The experimenters might not be aware of all the relevant characteristics they need to match. In other words, the groups might be different in an essential aspect that the researchers don’t recognize. For example, in the hypothetical vitamin study, there might be a healthy habit or attribute that affects the outcome that the researchers don’t measure and match. These unmatched characteristics might cause the observed differences in outcomes rather than vitamin consumption.

Learn more about Matched Pairs Design: Uses & Examples .

Using Multiple Regression in Observational Studies

Random assignment and matching use different methods to equalize the experimental groups in an observational study. However, statistical techniques, such as multiple regression analysis , don’t try to equalize the groups but instead use a model that accounts for confounding variables. These studies statistically control for confounding variables.

In multiple regression analysis, including a variable in the model holds it constant while you vary the variable/treatment of interest. For information about this property, read my post When Should I Use Regression Analysis?

As with matching, the challenge is to identify, measure, and include all confounders in the regression model. Failure to include a confounding variable in a regression model can cause omitted variable bias to distort your results.

Next, we’ll look at a published observational study that uses multiple regression to account for confounding variables.

Related post : Independent and Dependent Variables in a Regression Model

Vitamin Supplement Observational Study Example

Vitamins for the example of an observational study.

Murso et al. (2011)* use a longitudinal observational study that ran 22 years to assess differences in death rates for subjects who used vitamin supplements regularly compared to those who did not use them. This study used surveys to record the characteristics of approximately 40,000 participants. The surveys asked questions about potential confounding variables such as demographic information, food intake, health details, physical activity, and, of course, supplement intake.

Because this is an observational study, the subjects decided for themselves whether they were taking vitamin supplements. Consequently, it’s safe to assume that supplement users and non-users might be different in other ways. From their article, the researchers found the following pre-existing differences between the two groups:

Supplement users had a lower prevalence of diabetes mellitus, high blood pressure, and smoking status; a lower BMI and waist to hip ratio, and were less likely to live on a farm. Supplement users had a higher educational level, were more physically active and were more likely to use estrogen replacement therapy. Also, supplement users were more likely to have a lower intake of energy, total fat, and monounsaturated fatty acids, saturated fatty acids and to have a higher intake of protein, carbohydrates, polyunsaturated fatty acids, alcohol, whole grain products, fruits, and vegetables.

Whew! That’s a long list of differences! Supplement users were different from non-users in a multitude of ways that are likely to affect their risk of dying. The researchers must account for these confounding variables when they compare supplement users to non-users. If they do not, their results can be biased.

This example illustrates a key difference between an observational study vs experiment. In a randomized experiment, the randomization would have equalized the characteristics of those the researchers assigned to the treatment and control groups. Instead, the study works with self-sorted groups that have numerous pre-existing differences!

Using Multiple Regression to Statistically Control for Confounders

To account for these initial differences in the vitamin supplement observational study, the researchers use regression analysis and include the confounding variables in the model.

The researchers present three regression models. The simplest model accounts only for age and caloric intake. Next, are two models that include additional confounding variables beyond age and calories. The first model adds various demographic information and seven health measures. The second model includes everything in the previous model and adds several more specific dietary intake measures. Using statistical significance as a guide for specifying the correct regression model , the researchers present the model with the most variables as the basis for their final results.

It’s instructive to compare the raw results and the final regression results.

Raw results

The raw differences in death risks for consumers of folic acid, vitamin B6, magnesium, zinc, copper, and multivitamins are NOT statistically significant. However, the raw results show a significant reduction in the death risk for users of B complex, C, calcium, D, and E.

However, those are the raw results for the observational study, and they do not control for the long list of differences between the groups that exist at the beginning of the study. After using the regression model to control for the confounding variables statistically, the results change dramatically.

Adjusted results

Of the 15 supplements that the study tracked in the observational study, researchers found consuming seven of these supplements were linked to a statistically significant INCREASE in death risk ( p-value < 0.05): multivitamins (increase in death risk 2.4%), vitamin B6 (4.1%), iron (3.9%), folic acid (5.9%), zinc (3.0%), magnesium (3.6%), and copper (18.0%). Only calcium was associated with a statistically significant reduction in death risk of 3.8%.

In short, the raw results suggest that those who consume supplements either have the same or lower death risks than non-consumers. However, these results do not account for the multitude of healthier habits and attributes in the group that uses supplements.

In fact, these confounders seem to produce most of the apparent benefits in the raw results because, after you statistically control the effects of these confounding variables, the results worsen for those who consume vitamin supplements. The adjusted results indicate that most vitamin supplements actually increase your death risk!

This research illustrates the differences between an observational study vs experiment. Namely how the pre-existing differences between the groups allow confounders to bias the raw results, making the vitamin consumption outcomes look better than they really are.

In conclusion, if you can’t randomly assign subjects to the experimental groups, an observational study might be right for you. However, be aware that you’ll need to identify, measure, and account for confounding variables in your experimental design.

Jaakko Mursu, PhD; Kim Robien, PhD; Lisa J. Harnack, DrPH, MPH; Kyong Park, PhD; David R. Jacobs Jr, PhD; Dietary Supplements and Mortality Rate in Older Women: The Iowa Women’s Health Study ; Arch Intern Med . 2011;171(18):1625-1633.

Share this:

statistical experiment definition

Reader Interactions

' src=

December 30, 2023 at 5:05 am

I see, but our professor required us to indicate what year it was put into the article. May you tell me what year was this published originally? <3

' src=

December 30, 2023 at 3:40 pm

' src=

December 29, 2023 at 10:46 am

Hi, may I use your article as a citation for my thesis paper? If so, may I know the exact date you published this article? Thank you!

December 29, 2023 at 2:13 pm

Definitely feel free to cite this article! 🙂

When citing online resources, you typically use an “Accessed” date rather than a publication date because online content can change over time. For more information, read Purdue University’s Citing Electronic Resources .

' src=

November 18, 2021 at 10:09 pm

Love your content and has been very helpful!

Can you please advise the question below using an observational data set:

I have three years of observational GPS data collected on athletes (2019/2020/2021). Approximately 14-15 athletes per game and 8 games per year. The GPS software outputs 50+ variables for each athlete in each game, which we have narrowed down to 16 variables of interest from previous research.

2 factors 1) Period (first half, second half, and whole game), 2) Position (two groups with three subgroups in each – forwards (group 1, group 2, group 3) and backs (group 1, group 2, group 3))

16 variables of interest – all numerical and scale variables. Some of these are correlated, but not all.

My understanding is that I can use a oneway ANOVA for each year on it’s own, using one factor at a time (period or position) with post hoc analysis. This is fine, if data meets assumptions and is normally distributed. This tells me any significant interactions between variables of interest with chosen factor. For example, with position factor, do forwards in group 1 cover more total running distance than forwards in group 2 or backs in group 3.

However, I want to go deeper with my analysis. If I want to see if forwards in group 1 cover more total running distance in period 1 than backs in group 3 in the same period, I need an additional factor and the oneway ANOVA does not suit. Therefore I can use a twoway ANOVA instead of 2 oneway ANOVA’s and that solves the issue, correct?

This is complicated further by looking to compare 2019 to 2020 or 2019 to 2021 to identify changes over time, which would introduce a third independent variable.

I believe this would require a threeway ANOVA for this observational data set. 3 factors – Position, Period, and Year?

Are there any issues or concerns you see at first glance?

I appreciate your time and consideration.

' src=

April 12, 2021 at 2:02 pm

Could an observational study use a correlational design.

e.g. measuring effects of two variables on happiness, if you’re not intervening.

April 13, 2021 at 12:14 am

Typically, with observational studies, you’d want to include potential confounders, etc. Consequently, I’ve seen regression analysis used more frequently for observational studies to be able to control for other things because you’re not using randomization. You could use correlation to observe the relationship. However, you wouldn’t be controlling for potential confounding variables. Just something to consider.

' src=

April 11, 2021 at 1:28 pm

Hi, If I am to administer moderate doses of coffee for a hypothetical experiment, does it raise ethical concerns? Can I use random assignment for it?

April 11, 2021 at 4:06 pm

I don’t see any inherent ethical problems here as long as you describe the participant’s experience in the experiment including the coffee consumption. They key with human subjects is “informed consent.” They’re agreeing to participate based on a full and accurate understanding of what participation involves. Additionally, you as a researcher, understand the process well enough to be able to ensure their safety.

In your study, as long as subject know they’ll be drinking coffee and agree to that, I don’t see a problem. It’s a proven safe substance for the vast majority of people. If potential subjects are aware of the need to consume coffee, they can determine whether they are ok with that before agreeing to participate.

' src=

June 17, 2019 at 4:51 am

Really great article which explains observational and experimental study very well. It presents broad picture with the case study which helped a lot in understanding the core concepts. Thanks

Comments and Questions Cancel reply

What Is Statistical Analysis?

Statistical analysis helps you pull meaningful insights from data. The process involves working with data and deducing numbers to tell quantitative stories.

Abdishakur Hassan

Statistical analysis is a technique we use to find patterns in data and make inferences about those patterns to describe variability in the results of a data set or an experiment. 

In its simplest form, statistical analysis answers questions about:

  • Quantification — how big/small/tall/wide is it?
  • Variability — growth, increase, decline
  • The confidence level of these variabilities

What Are the 2 Types of Statistical Analysis?

  • Descriptive Statistics:  Descriptive statistical analysis describes the quality of the data by summarizing large data sets into single measures. 
  • Inferential Statistics:  Inferential statistical analysis allows you to draw conclusions from your sample data set and make predictions about a population using statistical tests.

What’s the Purpose of Statistical Analysis?

Using statistical analysis, you can determine trends in the data by calculating your data set’s mean or median. You can also analyze the variation between different data points from the mean to get the standard deviation . Furthermore, to test the validity of your statistical analysis conclusions, you can use hypothesis testing techniques, like P-value, to determine the likelihood that the observed variability could have occurred by chance.

More From Abdishakur Hassan The 7 Best Thematic Map Types for Geospatial Data

Statistical Analysis Methods

There are two major types of statistical data analysis: descriptive and inferential. 

Descriptive Statistical Analysis

Descriptive statistical analysis describes the quality of the data by summarizing large data sets into single measures. 

Within the descriptive analysis branch, there are two main types: measures of central tendency (i.e. mean, median and mode) and measures of dispersion or variation (i.e. variance , standard deviation and range). 

For example, you can calculate the average exam results in a class using central tendency or, in particular, the mean. In that case, you’d sum all student results and divide by the number of tests. You can also calculate the data set’s spread by calculating the variance. To calculate the variance, subtract each exam result in the data set from the mean, square the answer, add everything together and divide by the number of tests.

Inferential Statistics

On the other hand, inferential statistical analysis allows you to draw conclusions from your sample data set and make predictions about a population using statistical tests. 

There are two main types of inferential statistical analysis: hypothesis testing and regression analysis. We use hypothesis testing to test and validate assumptions in order to draw conclusions about a population from the sample data. Popular tests include Z-test, F-Test, ANOVA test and confidence intervals . On the other hand, regression analysis primarily estimates the relationship between a dependent variable and one or more independent variables. There are numerous types of regression analysis but the most popular ones include linear and logistic regression .  

Statistical Analysis Steps  

In the era of big data and data science, there is a rising demand for a more problem-driven approach. As a result, we must approach statistical analysis holistically. We may divide the entire process into five different and significant stages by using the well-known PPDAC model of statistics: Problem, Plan, Data, Analysis and Conclusion.

In the first stage, you define the problem you want to tackle and explore questions about the problem. 

2. Plan

Next is the planning phase. You can check whether data is available or if you need to collect data for your problem. You also determine what to measure and how to measure it. 

The third stage involves data collection, understanding the data and checking its quality. 

4. Analysis

Statistical data analysis is the fourth stage. Here you process and explore the data with the help of tables, graphs and other data visualizations.  You also develop and scrutinize your hypothesis in this stage of analysis. 

5. Conclusion

The final step involves interpretations and conclusions from your analysis. It also covers generating new ideas for the next iteration. Thus, statistical analysis is not a one-time event but an iterative process.

Statistical Analysis Uses

Statistical analysis is useful for research and decision making because it allows us to understand the world around us and draw conclusions by testing our assumptions. Statistical analysis is important for various applications, including:

  • Statistical quality control and analysis in product development 
  • Clinical trials
  • Customer satisfaction surveys and customer experience research 
  • Marketing operations management
  • Process improvement and optimization
  • Training needs 

More on Statistical Analysis From Built In Experts Intro to Descriptive Statistics for Machine Learning

Benefits of Statistical Analysis

Here are some of the reasons why statistical analysis is widespread in many applications and why it’s necessary:

Understand Data

Statistical analysis gives you a better understanding of the data and what they mean. These types of analyses provide information that would otherwise be difficult to obtain by merely looking at the numbers without considering their relationship.

Find Causal Relationships

Statistical analysis can help you investigate causation or establish the precise meaning of an experiment, like when you’re looking for a relationship between two variables.

Make Data-Informed Decisions

Businesses are constantly looking to find ways to improve their services and products . Statistical analysis allows you to make data-informed decisions about your business or future actions by helping you identify trends in your data, whether positive or negative. 

Determine Probability

Statistical analysis is an approach to understanding how the probability of certain events affects the outcome of an experiment. It helps scientists and engineers decide how much confidence they can have in the results of their research, how to interpret their data and what questions they can feasibly answer.

You’ve Got Questions. Our Experts Have Answers. Confidence Intervals, Explained!

What Are the Risks of Statistical Analysis?

Statistical analysis can be valuable and effective, but it’s an imperfect approach. Even if the analyst or researcher performs a thorough statistical analysis, there may still be known or unknown problems that can affect the results. Therefore, statistical analysis is not a one-size-fits-all process. If you want to get good results, you need to know what you’re doing. It can take a lot of time to figure out which type of statistical analysis will work best for your situation .

Thus, you should remember that our conclusions drawn from statistical analysis don’t always guarantee correct results. This can be dangerous when making business decisions. In marketing , for example, we may come to the wrong conclusion about a product . Therefore, the conclusions we draw from statistical data analysis are often approximated; testing for all factors affecting an observation is impossible.

Recent Statistical Analysis Articles

Singular Value Decomposition (SVD) Algorithm Explained

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

statistical experiment definition

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Prevent plagiarism. Run a free check.

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved August 7, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

  • There are only two possible outcomes called “success” and “failure” for each trial.
  • The probability p of a success is the same for any trial (so the probability q = 1 − p of a failure is the same for any trial).
  • There are a fixed number of trials, n .
  • There are only two possible outcomes, called "success" and, "failure," for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial.
  • The n trials are independent and are repeated using identical conditions.
  • There are one or more Bernoulli trials with all failures except the last one, which is a success.
  • In theory, the number of trials could go on forever. There must be at least one trial.
  • The probability, p , of a success and the probability, q , of a failure do not change from trial to trial.
  • You take samples from two groups.
  • You are concerned with a group of interest, called the first group.
  • You sample without replacement from the combined groups.
  • Each pick is not independent, since sampling is without replacement.
  • You are not dealing with Bernoulli Trials.
  • A fixed number of trials.
  • The probability of success is not the same from trial to trial.
  • The probability that the event occurs in a given interval is the same for all intervals.
  • The events occur with a known mean and independently of the time since the last event.
  • The domain of the random variable (RV) is not necessarily a numerical set; the domain may be expressed in words; for example, if X = hair color then the domain is {black, blond, gray, green, orange}.
  • We can tell what specific value x the random variable X takes only after performing the experiment.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics
  • Publication date: Sep 19, 2013
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics/pages/4-key-terms

© Jun 23, 2022 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

The home of mathematics education in New Zealand.

  • Forgot password ?

Statistical experiment

Thanks for visiting NZMaths. We are preparing to close this site by the end of August 2024. Maths content is still being migrated onto Tāhūrangi, and we will be progressively making enhancements to Tāhūrangi to improve the findability and presentation of content.  

For more information visit https://tahurangi.education.govt.nz/updates-to-nzmaths

A statistical experiment is a random or nondeterministic experiment. Its features are that:

  • each experiment is capable of being repeated indefinitely under essentially unchanged conditions.
  • Although we are in general not able to state what a particular outcome will be, we are able to describe the set of all possible outcomes of the experiment
  • As the experiment is performed repeatedly, the individual

outcomes seem to occur in a haphazard manner. However as the experiment is repeated a large number of times, a definite pattern or regularity appears.

IMAGES

  1. PPT

    statistical experiment definition

  2. PPT

    statistical experiment definition

  3. PPT

    statistical experiment definition

  4. PPT

    statistical experiment definition

  5. Statistical significance of experiment

    statistical experiment definition

  6. Experimental Probability

    statistical experiment definition

COMMENTS

  1. Experimental Design: Definition and Types

    An experiment is a data collection procedure that occurs in controlled conditions to identify and understand causal relationships between variables. Researchers can use many potential designs. The ultimate choice depends on their research question, resources, goals, and constraints. In some fields of study, researchers refer to experimental ...

  2. Experiment

    An experiment is a method to investigate the cause and effect relationship between two variables. Learn about the types, criteria, and examples of experiments in the statistics glossary.

  3. Experiment (probability theory)

    Random experiments are often conducted repeatedly, so that the collective results may be subjected to statistical analysis.A fixed number of repetitions of the same experiment can be thought of as a composed experiment, in which case the individual repetitions are called trials.For example, if one were to toss the same coin one hundred times and record each result, each toss would be ...

  4. Observational studies and experiments (article)

    Actually, the term is "Sample Survey" and you may search online for it. I think the difference lies in the aim of the three types of studies, sample surveys want to get data for a parameter while observational studies and experiments want to convert some data into information, i.e., correlation and causation respectively.

  5. Design of experiments

    Design of experiments (DOE) is a systematic, efficient method that enables scientists and engineers to study the relationship between multiple input variables (aka factors) and key output variables (aka responses). It is a structured approach for collecting data and making discoveries.

  6. Chapter 1 Principles of Experimental Design

    (Statistical) design of experiments provides the principles and methods for planning experiments and tailoring the data acquisition to an intended analysis. Design and analysis of an experiment are best considered as two aspects of the same enterprise: the goals of the analysis strongly inform an appropriate design, and the implemented design ...

  7. Statistical Design of Experiments (DoE)

    Abstract. In a cause-effect relationship, the design of experiments (DoE) is a means and method of determining the interrelationship in the required accuracy and scope with the lowest possible expenditure in terms of time, material, and other resources. In experiments, the question concerning which type and level of effect the influencing ...

  8. Theory of Statistical Experiments

    About this book. By a statistical experiment we mean the procedure of drawing a sample with the intention of making a decision. The sample values are to be regarded as the values of a random variable defined on some meas­ urable space, and the decisions made are to be functions of this random variable. Although the roots of this notion of ...

  9. Experimental Design in Statistics

    In the field of statistics, experimental design means the process of designing a statistical experiment, which is an experiment that is objective, controlled, and quantitative. An experiment is a ...

  10. 4.1: Probability Experiments and Sample Spaces

    An experiment is a planned operation carried out under controlled conditions. If the result is not predetermined, then the experiment is said to be a chance experiment. Flipping one fair coin twice is an example of an experiment. A result of an experiment is called an outcome. The sample space of an experiment is the set of all possible ...

  11. What Is Design of Experiments (DOE)?

    Quality Glossary Definition: Design of experiments. Design of experiments (DOE) is defined as a branch of applied statistics that deals with planning, conducting, analyzing, and interpreting controlled tests to evaluate the factors that control the value of a parameter or group of parameters. DOE is a powerful data collection and analysis tool ...

  12. Statistics

    Statistics - Sampling, Variables, Design: Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine, biology, marketing research, and industrial production.

  13. Introduction to experiment design (video)

    You use blocking to minimize the potential variables (also known as extraneous variables) from influencing your experimental result. Let's use the experiment example that Mr.Khan used in the video. To verify the effect of the pill, we need to make sure that the person's gender, health, or other personal traits don't affect the result.

  14. Choosing the Right Statistical Test

    When to perform a statistical test. You can perform statistical tests on data that have been collected in a statistically valid manner - either through an experiment, or through observations made using probability sampling methods.. For a statistical test to be valid, your sample size needs to be large enough to approximate the true distribution of the population being studied.

  15. 1.4 Designed Experiments

    Designed Experiments. The purpose of an experiment is to investigate the relationship between two variables. When one variable causes change in another, we call the first variable the explanatory variable. The affected variable is called the response variable. In a randomized experiment, the researcher manipulates values of the explanatory ...

  16. What is an Observational Study: Definition & Examples

    The definition of an observational study hinges on the notion that the researchers only observe subjects and do not assign them to the control and treatment groups. That's the key difference between an observational study vs experiment. These studies are also known as quasi-experiments and correlational studies.

  17. Statistical experiments and science experiments

    Statistical experiments provide estimates of coefficients and from there one should be able generalize to conditions within and outside the design space. Scientific experiments are providing a different type of generalizability which relies on scientific knowledge.

  18. What Is Statistical Analysis? (Definition, Methods)

    Statistical analysis is an approach to understanding how the probability of certain events affects the outcome of an experiment. It helps scientists and engineers decide how much confidence they can have in the results of their research, how to interpret their data and what questions they can feasibly answer.

  19. Hypothesis Testing

    Present the findings in your results and discussion section. Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps. Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test.

  20. Ch. 4 Key Terms

    Binomial Experiment a statistical experiment that satisfies the following three conditions: There are a fixed number of trials, n. There are only two possible outcomes, called "success" and, "failure," for each trial. The letter p denotes the probability of a success on one trial, and q denotes the probability of a failure on one trial.

  21. Statistical experiment

    A statistical experiment is a random or nondeterministic experiment. Its features are that: each experiment is capable of being repeated indefinitely under essentially unchanged conditions. outcomes seem to occur in a haphazard manner. However as the experiment is repeated a large number of times, a definite pattern or regularity appears.