Bayesian Inference in Python: A Comprehensive Guide with Examples

Bayesian

Data-driven decision-making has become essential across various fields, from finance and economics to medicine and engineering. Understanding probability and statistics is crucial for making informed choices today. Bayesian inference, a powerful tool in probabilistic reasoning, allows us to update our beliefs about an event based on new evidence.

Bayes’s theorem, a fundamental concept in probability theory, forms the foundation of Bayesian inference. This theorem provides a way to calculate the probability of an event occurring given prior knowledge and observed data. Combining our initial beliefs with the likelihood of the evidence, we can arrive at a more accurate posterior probability.

Bayesian inference is a statistical method based on Bayes’s theorem, which updates the probability of an event as new data becomes available. It is widely used in various fields, such as finance, medicine, and engineering, to make predictions and decisions based on prior knowledge and observed data. In Python, Bayesian inference can be implemented using libraries like NumPy and Matplotlib to generate and visualize posterior distributions.

This article will explore Bayesian inference and its implementation using Python, a popular programming language for data analysis and scientific computing. We will start by understanding the fundamentals of Bayes’s theorem and formula, then move on to a step-by-step guide on implementing Bayesian inference in Python. Along the way, we will discuss a real-world example of predicting website conversion rates to illustrate the practical application of this powerful technique.

Recommended: Python and Probability: Simulating Blackjack Card Counting with Python Code

Recommended: Understanding Joint Probability Distribution with Python

What is Bayesian Inference?

Bayesian inference is based on Bayes’s theorem, which is based on the prior probability of an event. As events happen, the probability of the event keeps updating. Let us look at the formula of Baye’s theorem.

Bayes Theorem Formula

Implementing Bayesian Inference in Python

Let us try to implement the same in Python with the code below.

Let us look at the output of the same.

Bayes Theorem Plot

Real-World Example: Predicting Website Conversion Rates with Bayesian Inference

Let us now look at the case of a website. We are trying to predict how many people will buy the product to the ratio of the number of visitors. We have also created some parameters.

Let us look at the output of the above code and try to deduce some information from it.

Bayesian Inference Output

According to our predictions, the probability of conversion is around 5%.

Here you go! Now, you know a lot more about Bayesian inference. In this article, we learned what Bayesian inference is and also touched upon how to implement it in the Python programming language. We also learned about a very simple case study where we calculated the probability of customers’ conversions if they visited a particular website. Bayesian inference, as mentioned, is also used heavily in the fields of finance, economics, and engineering.

Hope you enjoyed reading it!!

Recommended: Monte-Carlo Simulation to find the probability of Coin toss in python

Recommended: 5 Ways to Detect Fake Dollar Bills Using Python Machine Learning

Python Companion to Statistical Thinking in the 21st Century

Bayesian Statistics in Python

Bayesian statistics in python #.

In this chapter we will introduce how to basic Bayesian computations using Python.

Applying Bayes’ theorem: A simple example #

TBD: MOVE TO MULTIPLE TESTING EXAMPLE SO WE CAN USE BINOMIAL LIKELIHOOD A person has a cough and flu-like symptoms, and gets a PCR test for COVID-19, which comes back postiive. What is the likelihood that they actually have COVID-19, as opposed to a regular cold or flu? We can use Bayes’ theorem to compute this. Let’s say that the local rate of symptomatic individuals who actually are infected with COVID-19 is 7.4% (as reported on July 10, 2020 for San Francisco); thus, our prior probability that someone with symptoms actually has COVID-19 is .074. The RT-PCR test used to identify COVID-19 RNA is highly specific (that is, it very rarelly reports the presence of the virus when it is not present); for our example, we will say that the specificity is 99%. Its sensitivity is not known, but probably is no higher than 90%. First let’s look at the probability of disease given a single positive test.

The high specificity of the test, along with the relatively high base rate of the disease, means that most people who test positive actually have the disease. Now let’s plot the posterior as a function of the prior. Let’s first create a function to compute the posterior, and then apply this with a range of values for the prior.

_images/9f1f56d1bfcbd545f388c9e92f859ce900425e3c1505f3d0c1dcdb283a04128d.png

This figure highlights a very important general point about diagnostic testing: Even when the test has high specificity, if a condition is rare then most positive test results will be false positives.

Estimating posterior distributions #

In this example we will look at how to estimate entire posterior distributions. We will implement the drug testing example from the book. In that example, we administered a drug to 100 people, and found that 64 of them responded positively to the drug. What we want to estimate is the probability distribution for the proportion of responders, given the data. For simplicity we started with a uniform prior; that is, all proprtions of responding are equally likely to begin with. In addition, we will use a discrete probability distribution; that is, we will estimate the posterior probabiilty for each particular proportion of responders, in steps of 0.01. This greatly simplifies the math and still retains the main idea.

_images/400d417422ddd79a794f6b26e4a19aa29d931d54cff16d309fe5c27bef251f50.png

The plot shows that the posterior and likelihood are virtually identical, which is due to the fact that the prior is uniform across all possible values. Now let’s look at a case where the prior is not uniform. Let’s say that we now run a larger study of 1000 people with the same treatment, and we find that 312 of the 1000 individuals respond to the treatment. In this case, we can use the posterior from the earlier study of 100 people as the prior for our new study. This is what we sometimes refer to as Bayesian updating .

_images/ac08765479f0c7e2544c6a7d6ab4d363577b605f1a45315edb425259f5459462.png

Here we see two important things. First, we see that the prior is substantially wider than the likelihood, which occurs because there is much more data going into the likelihood (1000 data points) compared to the prior (100 data points), and more data reduces our uncertainty. Second, we see that the posterior is much closer to the value observed for the second study than for the first, which occurs for the same reason — we put greater weight on the estimate that is more precise due to a larger sample.

Bayes factors #

There are no convenient off-the-shelf tools for estimating Bayes factors using Python, so we will use the rpy2 package to access the BayesFactor library in R. Let’s compute a Bayes factor for a T-test comparing the amount of reported alcohol computing between smokers versus non-smokers. First, let’s set up the NHANES data and collect a sample of 150 smokers and 150 nonsmokers.

AvgAlcohol
SmokeNow
0 5.748000
1 11.875333

Now let’s use functions from R to perform a standard t-test as well as compute a Bayes Factor for this comparison.

This shows that the difference between these groups is significant, and the Bayes factor suggests fairly strong evidence for a difference.

Statistics: Data analysis and modelling

Chapter 16 introduction to bayesian hypothesis testing.

In this chapter, we will introduce an alternative to the Frequentist null-hypothesis significance testing procedure employed up to now, namely a Bayesian hypothesis testing procedure. This also consists of comparing statistical models. What is new here is that Bayesian models contain a prior distribution over the values of the model parameters. In doing so for both the null and the alternative model, Bayesian model comparisons provide a more direct measure of the relative evidence of the null model compared to the alternative. We will introduce the Bayes Factor as the primary measure of evidence in Bayesian model comparison. We then go on to discuss “default priors”, which can be useful in a Bayesian testing procedure. We end with an overview of some objections to the traditional Frequentist method of hypothesis testing, and a comparison between the two approaches.

16.1 Hypothesis testing, relative evidence, and the Bayes factor

In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G. Moreover, we defined the testing procedure in terms of determining the probability of a test result, or one more extreme, given that the simpler MODEL R is the true model. This was necessary because MODEL G is too vague to determine the sampling distribution of the test statistic.

By supplying a prior distribution to parameters, Bayesian models can be “vague” whilst not suffering from the problem that they effectively make no predictions. As we saw for the prior predictive distributions in Figure 15.3 , even MODEL 1, which assumes all possible values of the parameter \(\theta\) are equally likely, still provides a valid predicted distribution of the data. Because any Bayesian model with a valid prior distribution provides a valid prior predictive distribution, which then also provides a valid value for the marginal likelihood, we do not have to worry about complications that arise when comparing models in the Frequentist tradition, such as that the likelihood of one model will always be higher than the other because we need to estimate an additional parameter by maximum likelihood. The relative marginal likelihood of the data assigned by each model, which can be stated as a marginal likelihood ratio analogous to the likelihood ratio of Chapter 2 , provides a direct measure of the relative evidence for both models. The marginal likelihood ratio is also called the Bayes factor, and can be defined for two general Bayesian models as: \[\begin{equation} \text{BF}_{12} = \text{BayesFactor}(\text{MODEL 1}, \text{MODEL 2}) = \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \tag{16.1} \end{equation}\] where \(p(Y_1,\ldots,Y_n|\text{MODEL } j)\) denotes the marginal likelihood of observed data \(Y_1,\ldots,Y_n\) according to MODEL \(j\) .

The Bayes factor is a central statistic of interest in Bayesian hypothesis testing. It is a direct measure of the relative evidence for two models. Its importance can also be seen when we consider the ratio of the posterior probabilities for two models, which is also called the posterior odds . In a Bayesian framework, we can assign probabilities not just to data and parameters, but also to whole models. These probabilities reflect our belief that a model is “true” in the sense that it provides a better account of the data than other models. Before observing data, we can assign a prior probability \(p(\text{model } j)\) to a model, and we can update this to a posterior probability \(p(\text{model } j|Y_1,\ldots,Y_n)\) after observing data \(Y_1,\ldots,Y_n\) . If the marginal likelihood \(p(\text{MODEL 2}|Y_1,\ldots,Y_n)\) is larger than 1, the posterior probability is higher than the prior probability, and hence our belief in the model would increase. If the marginal likelihood is smaller than 1, the posterior probability is lower than the prior probability, and hence our belief in the model would decrease. We can compare the relative change in our belief for two models by considering the posterior odds ratio, which is just the ratio of the posterior probability of two models, and computed by multiplying the ratio of the prior probabilities of the models (the prior odds ratio) by the marginal likelihood ratio: \[\begin{aligned} \frac{p(\text{MODEL 1}|Y_1,\ldots,Y_n)}{p(\text{MODEL 2}|Y_1,\ldots,Y_n)} &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL 1})}{p(Y_1,\ldots,Y_n|\text{MODEL 2})} \times \frac{p(\text{MODEL 1})}{p(\text{MODEL 2})} \\ \text{posterior odds} &= \text{Bayes factor} \times \text{prior odds} \end{aligned}\]

In terms of the relative evidence that the data provides for the two models, the Bayes factor is all that matters, as the prior probabilities do not depend on the data. Moreover, if we assign an equal prior probability to each model, then the prior odds ratio would equal 1, and hence the posterior odds ratio is identical to the Bayes factor.

In a Frequentist framework, we would evaluate the magnitude of the likelihood ratio by considering its place within the sampling distribution under the assumption that one of the models is true. Although in principle we might be able to determine the sampling distribution of the Bayes factor in a similar manner, there is no need. A main reason for going through all this work in the Frequentist procedure was that the models are on unequal footing, with the likelihood ratio always favouring a model with additional parameters. The Bayes Factor does not inherently favour a more general model compared to a restricted one. Hence, we can interpret its value “as is”. The Bayes factor is a continuous measure of relative evidential support, and there is no real need for classifications such as “significant” and “non-significant”. Nevertheless, some guidance in interpreting the magnitude might be useful. One convention is the classification provided by Jeffreys ( 1939 ) in Table 16.1 . Because small values below 1, when the Bayes factor favours the second model, can be difficult to discern, the table also provides the corresponding values of the logarithm of the Bayes factor ( \(\log \text{BF}_{1,2}\) ). On a logarithmic scale, any value above 0 favours the first model, and any value below 0 the second one. Moreover, magnitudes above and below 0 can be assigned a similar meaning.

Table 16.1: Interpretation of the values of the Bayes factor (after ).
\(\text{BF}_{1,2}\) \(\log \text{BF}_{1,2}\) Interpretation
> 100 > 4.61 Extreme evidence for MODEL 1
30 – 100 3.4 – 4.61 Very strong evidence for MODEL 1
10 – 30 2.3 – 3.4 Strong evidence for MODEL 1
3 – 10 1.1 – 2.3 Moderate evidence for MODEL 1
1 – 3 0 – 1.1 Anecdotal evidence for MODEL 1
1 0 No evidence
1/3 – 1 -1.1 – 0 Anecdotal evidence for MODEL 2
1/10 – 1/3 -2.3 – -1.1 Moderate evidence for MODEL 2
1/30 – 1/10 -3.4 – -2.3 Strong evidence for MODEL 2
1/100 – 1/30 -4.61 – -3.4 Very strong evidence for MODEL 2
< 1/100 < -4.61 Extreme evidence for MODEL 2

The Bayes factor is a general measure that can be used to compare any Bayesian models. We do not have to focus on nested models, as we did with null-hypothesis significance testing. But such nested model comparisons are often of interest. For instance, when considering Paul’s psychic abilities, fixing \(\theta = .5\) is a useful model of an octopus without psychic abilities, while a model that allows \(\theta\) to take other values is a useful model of a (somewhat) psychic octopus. For the first model, assigning prior probability to \(\theta\) is simple: the prior probability of \(\theta = .5\) is \(p(\theta = .5) = 1\) , and \(p(\theta \neq .5) = 0\) . For the second model, we need to consider how likely each possible value of \(\theta\) is. Figure 15.3 shows two choices for this prior distribution, which are both valid representations of the belief that \(\theta\) can be different from .5. These choices will give a different marginal likelihood, and hence a different value of the Bayes factor when comparing them to the restricted null-model

\[\text{MODEL 0}: \theta = .5\] The Bayes factor comparing MODEL 1 to MODEL 0 is

\[\text{BF}_{1,0} = 12.003\] which indicates that the data (12 out of 14 correct predictions) is roughly 12 times as likely under MODEL 1 compared to MODEL 0, which in the classification of Table 16.1 means strong evidence for MODEL 1. For MODEL 2, the Bayes factor is

\[\text{BF}_{2,0} = 36.409\] which indicates that the data is roughly 36 times as likely under MODEL 2 compared to MODEL 0, which would be classified as very strong evidence for MODEL 2. In both cases, the data favours the alternative model to the null model and may be taken as sufficient to reject MODEL 0. However, the strength of the evidence varies with the choice of prior distribution of the alternative model. This is as it should be. A model such as such as MODEL 2, which places stronger belief on higher values of \(\theta\) , is more consistent with Paul’s high number of correct predictions.

Bayesian hypothesis testing with Bayes factors is, at it’s heart, a model comparison procedure. Bayesian models consist of a likelihood function and a prior distribution. A different prior distribution means a different model, and therefore a different result of the model comparison. Because there are an infinite number of alternative prior distributions to the one of the null model, there really isn’t a single test of the null hypothesis \(H_0: \theta = .5\) . The prior distribution of MODEL 1, where each possible value of \(\theta\) is equally likely, is the Bayesian equivalent of the alternative hypothesis in a null-hypothesis significance testing, and as such might seem a natural default against which to compare the null hypothesis. But there is nothing to force this choice, and other priors are in principle equally valid, as long as they reflect your a priori beliefs about likely values of the parameter. Notice the “a priori” specification in the last sentence: it is vital that the prior distribution is chosen before observing the data. If you choose the a prior distribution to match the data after having looked at it, the procedure loses some of its meaning as a hypothesis test, even if the Bayes factor is still an accurate reflection of the evidential support of the models.

16.2 Parameter estimates and credible intervals

Maximum likelihood estimation provides a single point-estimate for each parameter. In a Bayesian framework, estimation involves updating prior beliefs to posterior beliefs. What we end up with is a posterior distribution over the parameter values. If you want to report a single estimate, you could chose one of the measures of location: the mean, median, or mode of the posterior distribution. Unless the posterior is symmetric, these will have different values (see Figure 16.1 ), and one is not necessarily better than the other. I would usually choose the posterior mean, but if the posterior is very skewed, a measure such as the mode or median might provide a better reflection of the location of the distribution.

In addition to reporting an estimate, it is generally also a good idea to consider the uncertainty in the posterior distribution. A Bayesian version of a confidence interval (with a more straightforward interpretation!) is a credible interval . A credible interval is an interval in the posterior distribution which contains a given proportion of the probability mass. A common interval is the Highest Density Interval (HDI), which is the narrowest interval which contains a given proportion of the probability mass. Figure 16.1 shows the 95% HDI of the posterior probability that Paul makes a correct prediction, where the prior distribution was the uniform distribution of MODEL 1 in Figure 15.2 .

Figure 16.1: Posterior distribution for the probability that Paul makes a correct prediction, for MODEL 1 in Figure 15.2 .

A slightly different way to compute a credible interval is as the central credible interval . For such an interval, the excluded left and right tail of the distribution each contain \(\tfrac{\alpha}{2}\) of the probability mass (where e.g.  \(\alpha = .05\) for a 95% credible interval). Unlike the HDI, the central credible interval is not generally the most narrow interval which contains a given proportion of the posterior probability. But it is generally more straightforward to compute. Nevertheless, the HDI is more often reported than the central credible interval.

A nice thing about credible intervals is that they have a straightforward interpretation: the (subjective) probability that the true value of a parameter lies within an \(x\) % credible interval is \(x\) %. Compare this to the correct interpretation of an \(x\) % Frequentist confidence interval, which is that for infinite samples from the DGP, and computing an infinite number of corresponding confidence intervals, \(x\) % of those intervals would contain the true value of the parameter.

16.3 A Bayesian t-test

As discussed above, Bayesian hypothesis testing concerns comparing models with different prior distributions for model parameters. If one model, the “null model”, restricts a parameter to take a specific value, such as \(\theta = .5\) , or \(\mu = 0\) , while another model allows the parameter to take different values, we compare a restricted model to a more general one, and hence we can think of the model comparison as a Bayesian equivalent to a null-hypothesis significance test. The prior distribution assigned to the parameter in the more general alternative model will determine the outcome of the test, and hence it is of the utmost importance to choose this sensibly. This, however, is not always easy. Therefore, much work has been conducted to derive sensible default priors to enable researchers to conduct Bayesian hypothesis tests without requiring them to define prior distributions which reflect their own subjective beliefs.

Rouder, Speckman, Sun, Morey, & Iverson ( 2009 ) developed a default prior distribution to test whether two groups have a different mean. The test is based on the two-group version of the General Linear Model (e.g. Section 7.2 ):

\[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i \quad \quad \epsilon_i \sim \textbf{Normal}(0, \sigma_\epsilon)\] where \(X_{1,i}\) is a contrast-coded predictor with the values \(X_{1i} = \pm \tfrac{1}{2}\) for the different groups. Remember that with this contrast code, the slope \(\beta_1\) reflects the difference between the group means, e.g.  \(\beta_1 = \mu_1 - \mu_2\) , and the intercept represents the grand mean \(\beta_0 = \frac{\mu_1 + \mu_2}{2}\) . Testing for group differences involves a test of the following hypotheses:

\[\begin{aligned} H_0\!: & \quad \beta_1 = 0 \\ H_1\!: & \quad \beta_1 \neq 0 \\ \end{aligned}\]

To do this in a Bayesian framework, we need prior distributions for all the model parameters ( \(\beta_0\) , \(\beta_1\) , and \(\sigma_\epsilon\) ). Rouder et al. ( 2009 ) propose to use so-called uninformative priors for \(\beta_0\) and \(\sigma_\epsilon\) (effectively meaning that for these parameters, “anything goes”). The main consideration is then the prior distribution for \(\beta_1\) . Rather than defining a prior distribution for \(\beta_1\) directly, they propose to define a prior distribution for \(\frac{\beta_1}{\sigma_\epsilon}\) , which is the difference between the group means divided by the variance of the dependent variable within each group. This is a measure of effect-size and is also known as Cohen’s \(d\) :

\[\text{Cohen's } d = \frac{\mu_1 - \mu_2}{\sigma_\epsilon} \quad \left(= \frac{\beta_1}{\sigma_\epsilon}\right)\] Defining the prior distribution for the effect-size is more convenient than defining the prior distribution for the difference between the means, as the latter difference is dependent on the scale of the dependent variable, which makes it difficult to define a general prior distribution suitable for all two-group comparisons. The “default” prior distribution they propose is a so-called scaled Cauchy distribution:

\[\frac{\beta_1}{\sigma_\epsilon} \sim \mathbf{Cauchy}(r)\] The Cauchy distribution is identical to a \(t\) -distribution with one degree of freedom ( \(\text{df} = 1\) ). The scaling factor \(r\) can be used to change the width of the distribution, so that either smaller or larger effect sizes become more probable. Examples of the distribution, with three common values for the scaling factor \(r\) (“medium”: \(r = \frac{\sqrt{2}}{2}\) , “wide”: \(r = 1\) , and “ultrawide”: \(r = \sqrt{2}\) ), are depicted in Figure 16.2 .

Figure 16.2: Scaled Cauchy prior distributions on the effect size \(\frac{\beta_1}{\sigma_\epsilon}\)

Rouder et al. ( 2009 ) call the combination of the priors for the effect size and error variance the Jeffreys-Zellner-Siow prior (JZS prior). The “default” Bayesian t-test is to compare the model with these priors to one which assumes \(\beta_1 = 0\) , i.e. a model with a prior \(p(\beta_1 = 0) = 1\) and \(p(\beta_1 \neq 0) = 0\) , whilst using the same prior distributions for the other parameters ( \(\beta_0\) and \(\sigma_\epsilon\) ).

As an example, we can apply the Bayesian t-test to the data from the Tetris study analysed in Chapter 7 . Comparing the Tetris+Reactivation condition to the Reactivation-Only condition, and setting the scale of the prior distribution for the effects size in the alternative MODEL 1 to \(r=1\) , provides a Bayes factor comparing the alternative hypothesis \(H_1\) ( \(\beta \neq 0\) ) to the null-hypothesis \(H_0\) ( \(\beta_1 = 0\) ) of \(\text{BF}_{1,0} = 17.225\) , which can be interpreted as strong evidence against the null hypothesis.

As we indicated earlier, the value of the Bayes factor depends on the prior distribution for the tested parameter in the model representing the alternative hypothesis. This dependence is shown in Figure 16.3 by varying the scaling factor \(r\) .

Figure 16.3: Bayes factor \(\text{BF}_{1,0}\) testing equivalence of the means of the Tetris+Reactivation and Reactivation-Only conditions for different values of the scaling factor \(r\) of the scaled Cauchy distribution.

As this figure shows, the Bayes factor is small for values of \(r\) close to 0. The lower the value of \(r\) , the less wide the resulting Cauchy distribution becomes. In the limit, as \(r\) reaches 0, the prior distribution in the alternative model becomes the same as that of the null model (i.e., assigning only probability to the value \(\beta_1 = 0\) ). This makes the models indistinguishable, and the Bayes factor would be 1, regardless of the data. As \(r\) increases in value, we see that the Bayes factor quickly rises, showing support for the alternative model. For this data, the Bayes factor is largest for a scaling factor just below \(r=1\) . When the prior distribution becomes wider than this, the Bayes factor decreases again. This is because the prior distribution then effectively assigns too much probability to high values of the effect size, and as a result lower probability to small and medium values of the effect size. At some point, the probability assigned to the effect size in the data becomes so low, that the null model will provide a better account of the data than the alternative model. A plot like the one in Figure 16.3 is useful to inspect the robustness of a test result to the specification of the prior distribution. In this case, the Bayes factor shows strong evidence ( \(\text{BF}_{1,0} > 10\) ) for a wide range of sensible values of \(r\) , and hence one might consider the test result quite robust. You should not use a plot like this to determine the “optimal” choice of the prior distribution (i.e. the one with the highest Bayes factor). If you did this, then the prior distribution would depend on the data, which is sometimes referred to as “double-dipping”. You would then end up with similar issues as in Frequentist hypothesis testing, where substituting an unknown parameter with a maximum likelihood estimate biases the likelihood ratio to favour the alternative hypothesis, which we then needed to correct for by considering the sampling distribution of the likelihood ratio statistic under the assumption that the null hypothesis is true. A nice thing about Bayes factors is that we do not need to worry about such complications. But that changes if you try to “optimise” a prior distribution by looking at the data.

16.4 Bayes factors for General Linear Models

The suggested default prior distributions can be generalized straightforwardly to more complex versions of the General Linear Model, such as multiple regression ( Liang, Paulo, Molina, Clyde, & Berger, 2008 ) and ANOVA models ( Rouder, Morey, Speckman, & Province, 2012 ) , by specifying analogous JZS prior distributions over all parameters. This provides a means to test each parameter in a model individually, as well as computing omnibus tests by comparing a general model to one where the prior distribution allows only a single value (i.e.  \(\beta_j = 0\) ) for multiple parameters.

Table 16.2 shows the results of a Bayesian equivalent to the moderated regression model discussed in Section 6.1.5 . The results generally confirm the results of the frequenist tests employed there, although evidence for the interaction between fun and intelligence can be classified as “anecdotal”.

Table 16.2: Results of a Bayesian regression analysis for the Speed Dating data (cf Table ) with a default JZS prior with ‘medium’ scaling factor \(r = \sqrt{2}/4\) (for regression models, default scaling factors are \(\sqrt{2}/4\), \(1/2\), and \(\sqrt{2}/2\) for medium, wide, and ultrawide, respectively). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{attr}\) > 1000
\(\texttt{intel}\) > 1000
\(\texttt{fun}\) > 1000
\(\texttt{attr} \times \texttt{intel}\) 37.46
\(\texttt{fun} \times \texttt{intel}\) 2.05

Table 16.3 shows the Bayesian equivalent of the factorial ANOVA reported in Section 8.2.1 . The results show “extreme” evidence for an effect of experimenter belief, and no evidence for an effect of power prime, nor for an interaction between power prime and experimenter belief. In the Frequentist null-hypothesis significance test, the absence of a significant test result can not be taken as direct evidence for the null hypothesis. There is actually no straightforward way to quantify the evidence for the null hypothesis in a Frequentist framework. This is not so for the Bayesian hypothesis tests. Indeed, the Bayes factor directly quantifies the relative evidence for either the alternative or null hypothesis. Hence, we find “moderate” evidence that the null hypothesis is true for power prime, and for the interaction between power prime and experimenter belief. This ability to quantify evidence both for and against the null hypothesis is one of the major benefits of a Bayesian hypothesis testing procedure.

Table 16.3: Results of a Bayesian factorial ANOVA analysis for the social priming data (cf Table ) with a default JZS prior with a ‘medium’ scaling factor of \(r = 1/2\) (for ANOVA models, default scaling factors are \(1/2\), \(\sqrt{2}/2\), and \(1\) for medium, wide, and ultrawide, respectively; this assumes standard effect coding for the contrast-coded predictors, which then matches the priors to those set for the linear regression model). The test of each effect compares the full model to one with that effect excluded.
effect BF
\(\texttt{P}\) 0.127
\(\texttt{B}\) 537.743
\(\texttt{P} \times \texttt{B}\) 0.216

16.5 Some objections to null-hypothesis significance testing

Above, we have presented a Bayesian alternative to the traditional Frequentist null-hypothesis significance testing (NHST) procedure. While still the dominant method of statistical inference in psychology, the appropriateness of the NHST has been hotly debated almost since its inception ( Cohen, 1994 ; Nickerson, 2000 ; Wagenmakers, 2007 ) . One issue is that a significant test result is not the same as a “theoretical” or “practical” significance. For a given true effect not equal to 0, the (expected) \(p\) -value becomes smaller and smaller as the sample size increases, because of the increased power in detecting that effect. As a result, even the smallest effect size will become significant for a sufficiently large sample size. For example, a medicine might result in a significant decrease of a symptom compared to a placebo, even if the effect is hardly noticeable to the patient. I should point out that this is more an issue with testing a “point” null hypothesis (e.g. the hypothesis that the effect is exactly equal to 0), rather than an issue with the Frequentist procedure per se. It is an important limitation of null hypothesis testing procedures in general. A similar objection to these hypotheses is that the null hypothesis is unlikely to ever be exactly true. Thompson ( 1992 ) states the potential issues strongly as:

Statistical significance testing can involve a tautological logic in which tired researchers, having collected data on hundreds of subjects, then, conduct a statistical test to evaluate whether there were a lot of subjects, which the researchers already know, because they collected the data and know they are tired. This tautology has created considerable damage as regards the cumulation of knowledge. ( Thompson, 1992, p. 436 )

There are other objections, which I will go into in the following sections.

16.5.1 The \(p\) -value is not a proper measure of evidential support

It is common practice to interpret the magnitude of the \(p\) -value as an indication of the strength of the evidence against the null hypothesis. That is, a smaller \(p\) -value is taken to indicate stronger evidence against the null hypothesis than a larger \(p\) -value. Indeed, Fisher himself seems to have subscribed to this view ( Wagenmakers, 2007 ) . While it is true that the magnitude is often correlated with the strength of evidence, there are some tricky issues regarding this. If a \(p\) -value were a “proper” measure of evidential support, then if two experiments provide the same \(p\) -value, they should provide the same support against the null hypothesis. But what if the first experiment had a sample size of 10, and the second a sample size of 10,000? Would a \(p\) -value of say \(p=.04\) indicate the same evidence against the null-hypothesis? The general consensus is that sample size is an important consideration in the interpretation of the \(p\) -value, although not always for the same reason. On the one hand, many researchers argue that the \(p\) -value of the larger study provides stronger evidence, possibly because the significant result in the larger study might be less likely due to random sample variability (see e.g. Rosenthal & Gaito, 1963 ) . On the other hand, it can be argued that the smaller study actually provides stronger evidence, because to obtain the same \(p\) -value, the effect size must be larger in the smaller study. Bayesian analysis suggests the latter interpretation is the correct one ( Wagenmakers, 2007 ) . That the same \(p\) -value can indicate a different strength of evidence means that the \(p\) -value does not directly reflect evidential support (at least not without considering the sample size).

Another thing worth pointing out is that, if the null hypothesis is true, any \(p\) -value is equally likely. This is by definition. Remember that the \(p\) -value is defined as the probability of obtaining the same test statistic, or one more extreme, assuming the null-hypothesis is true. A \(p\) -value of say \(p=.04\) indicates that you would expect to find an equal or more extreme value of the test statistic in 4% of all possible replications of the experiment. Conversely, in 4% of all replications would you obtain a \(p\) -value of \(p \leq .04\) . For a \(p\) -value of \(p=.1\) , you would expect to find a similar or smaller \(p\) -value in 10% of all replications of the experiment. The only distribution for which this relation between the value ( \(p\) ) and the probability of obtaining a value equal-or-smaller than it \(p(p-\text{value} \leq p)\) , is the uniform distribution. So, when the null hypothesis is true, there is no reason to expect a large \(p\) -value, because every \(p\) -value is equally likely. When the null hypothesis is false, smaller \(p\) -values are more likely than higher \(p\) -values, especially as the sample size increases. This is show by simulation for a one-sample t-test in Figure 16.4 . Under the null hypothesis (left plot), the distribution of the \(p\) -values is uniform.

Figure 16.4: Distribution of \(p\) -values for 10,000 simulations of a one-sample \(t\) -test. \(\delta = \frac{\mu - \mu_0}{\sigma}\) refers to the effect size. Under the null hypothesis (left plot; \(\delta = 0\) ) the distribution of the \(p\) -values is uniform. When the null-hypothesis is false ( \(\delta = .3\) ), the distribution is skewed, with smaller \(p\) -values being more probable, especially when the sample size is larger (compare the middle plot with \(n=10\) to the right-hand plot with \(n=50\) ).

16.5.2 The \(p\) -value depends on researcher intentions

The sampling distribution of a test statistic is the distribution of the values of the statistic calculated for an infinite number of datasets produced by the same Data Generating Process (DGP). The DGP includes all the relevant factors that affect the data, including not only characteristics of the population under study, but also characteristics of the study, such as whether participants were randomly sampled, how many participants were included, which measurement tools were used, etc. Choices such as when to stop collecting data are part of the study design. That means that the same data can have a different \(p\) -value, depending on whether the sample size was fixed a priori, or whether sampling continued until some criterion was reached. The following story, paraphrased from ( Berger & Wolpert, 1988, pp. 30–33 ) , may highlight the issue:

A scientist has obtained 100 independent observations that are assumed be Normal-distributed with mean \(\mu\) and standard deviation \(\sigma\) . In order to test the null hypothesis that \(\mu=0\) , the scientist consults a Frequentist statistician. The mean of the observations is \(\overline{Y} = 0.2\) , and the sample standard deviation is \(S_Y=1\) , hence the \(p\) -value is \(p = .0482\) , which is a little lower than than the adopted significance level of \(\alpha.05\) . This leads to a rejection of the null hypothesis, and a happy scientist. However, the statistician decides to probe deeper and asks the scientist what he would have done in case that the experiment had not yielded a significant result after 100 observations. The scientist replies he would have collected another 100 observations. As such, the implicit sampling plan was not to collect \(n=100\) observation and stop, but rather to first take 100 observations and check whether \(p <.05\) , and collect another 100 observations (resulting in \(n=200\) ) if not. This is a so-called sequential testing procedure, and requires a different treatment than a fixed-sampling procedure. In controlling the Type 1 error of the procedure as a whole, one would need to consider the possible results after \(n=100\) observations, but also after \(n=200\) observations, which is possible, but not straightforward, as the results of after \(n=100\) are dependent on the results after \(n=100\) observations. But the clever statistician works it out and then convinces the scientist that the appropriate p-value for this sequential testing procedure is no longer significant. The puzzled and disappointed scientist leaves to collect another 100 observations. After lots of hard work, the scientist returns, and the statistician computes a \(p\) -value for the new data, which is now significant. Just to make sure the sampling plan is appropriately reflected in the calculation, the statistician asks what the scientist would have done if the result would not have been significant at this point. The scientist answers “This would depend on the status of my funding; If my grant is renewed, I would test another 100 observations. If my grant is not renewed, I would have had to stop the experiment. Not that this matters, of course, because the data were significant anyway”. The statistician then explains that the correct inference depends on the grant renewal; if the grant is not renewed, the sampling plan stands and no correction is necessary. But if the grant is renewed, the scientist could have collected more data, which calls for a further correction, similar to the first one. The annoyed scientist then leaves and resolves to never again share with the statistician her hypothetical research intentions.

What this story shows is that in considering infinite possible repetitions of a study, everything about the study that might lead to variations in the results should be taken into account. This includes a scientists’ decisions made during each hypothetical replication of the study. As such, the interpretation of the data at hand (i.e., whether the hypothesis test is significant or not significant) depends on hypothetical decisions in situations that did not actually occur. If exactly the same data had been collected by a scientist who would have not have collected more observations, regardless of the outcome of the first test, then the result would have been judged significant. So the same data can provide different evidence. This does not mean the Frequentist NHST is inconsistent. The procedure “does what it says on the tin”, namely providing a bound on the rate of Type 2 errors in decisions , when the null hypothesis is true. In considering the accuracy of the decision procedure, we need to consider all situations in which a decision might be made in the context of a given study. This means considering the full design of the study, including the sampling plan, as well as, to some extent, the analysis plan. For instance, if you were to “explore” the data, trying out different ways to analyse the data, by e.g. including or excluding potential covariates and applying different criteria to excluding participants or their responses until you obtain a significant test result for an effect of interest, then the significance level \(\alpha\) for that test needs to be adjusted to account for such a fishing expedition. This fishing expedition is also called p-hacking ( Simmons, Nelson, & Simonsohn, 2011 ) and there really isn’t a suitable correction for it. Although corrections for multiple comparisons exist, which allow you to test all possible comparisons within a single model (e.g. the Scheffé correction), when you go on to consider different models, and different subsets of the data to apply that model to, all bets are off. This, simply put, is just really bad scientific practice. And it would render the \(p\) -value meaningless.

16.5.3 Results of a NHST are often misinterpreted

I have said it before, and I will say it again: the \(p\) -value is the probability of observing a particular value of a test statistic, or one more extreme, given that the null-hypothesis is true. This is the proper, and only, interpretation of the \(p\) -value. It is a tricky one, to be sure, and the meaning of the \(p\) -value is often misunderstood. Some common misconceptions (see e.g., Nickerson, 2000 ) are:

  • The \(p\) -value is the probability that the null-hypothesis is true, given the data, i.e.  \(p = p(H_0|\text{data})\) . This posterior probability can be calculated in a Bayesian framework, but not in a Frequentist one.
  • One minus the \(p\) -value is the probability that the alternative hypothesis is true, given the data, i.e.  \(1-p = p(H_1|\text{data})\) . Again, the posterior probability of the alternative hypothesis can be obtained in a Bayesian framework, when the alternative hypothesis is properly defined by a suitable prior distribution. In the conventional Frequentist NHST, the alternative hypothesis is so poorly defined, that it can’t be assigned any probability (apart from perhaps \(p(H_1) = p(H_1|\text{data}) = 1\) , which does not depend on the data, and just reflects that e.g.  \(-\infty \leq \mu - \mu_0 \leq \infty\) will have some value).
  • The \(p\) -value is the probability that the results were due to random chance. If you take a statistical model seriously, then all results are, to some extent, due to random chance. Trying to work out the probability that something is a probability seems a rather pointless exercise (if you want to know the answer, it is 1. It would have been more fun if the answer was 42, but alas, the scale of probabilities does not allow this particular answer).

Misinterpretations of \(p\) -values are mistakes by practitioners, and do not indicate a problem with NHST itself. However, it does point to a mismatch between what the procedure provides, and what the practitioner would like the procedure to provide. If one desires to know the probability that the null hypothesis is true, or the probability that the alternative hypothesis is true, than one has to use a Bayesian procedure. Unless you consider a wider context, where the truth of hypotheses can be sampled from a distribution, then there is no “long-run frequency” for the truth of hypotheses, and hence no Frequentist definition of that probability.

16.6 To Bayes or not to Bayes? A pragmatic view

At this point, you might feel slightly annoyed. Perhaps even very annoyed. We have spent all the preceding chapters focusing on the Frequentist null hypothesis significance testing procedure, and after all that work I’m informing you of these issues. Why? Was all that work for nothing?

No, obviously not. Although much of the criticism regarding the NHST is appropriate, as long as you understand what it does and apply the procedure properly, there is no need to abandon it. The NHST is designed to limit the rate of Type 1 errors (rejecting the null hypothesis when it is true). It does this well. And, when using the appropriate test statistic, in the most powerful way possible. Limiting Type 1 errors is, whilst modest, a reasonable concern in scientific practice. The Bayesian alternative allows you to do more, such as evaluate the relative evidence for and against the null hypothesis, and even calculate the posterior probability of both (as long as you are willing to assign a prior probability to both as well).

An advantage of the NHST is its “objectiveness”: once you have determined a suitable distribution of the data, and decided on a particular value for a parameter to test, there are no other decisions to make apart from setting the significance level of the test. In the Bayesian hypothesis testing procedure, you also need to specify a prior distribution for the parameter of interest in the alternative hypothesis. Although considering what parameter values you would expect if the null hypothesis were false is an inherently important consideration, it is often not straightforward when you start a research project, or rely on measures you have not used before in a particular context. Although much work has been devoted to deriving sensible “default priors”, I don’t believe there is a sensible objective prior applicable to all situations. Given a freedom to choose a prior distribution for the alternative hypothesis, this makes the Bayesian testing procedure inherently subjective. This is perfectly in keeping with the subjectivist interpretation of probability as the rational belief of an agent endowed with (subjective) prior beliefs. Moreover, at some point, if you were to accumulate all data, the effect of prior beliefs “washes out” (as long as you don’t assign a probability of zero to the true parameter value).

My pragmatic answer to the question whether you should use a Bayesian test or a Frequentist one is then the following: if you can define a suitable prior distribution to reflect what you expect to observe in a study, before you actually conduct that study, then use a Bayesian testing procedure. This will allow you to do what you most likely would like to do, namely quantify the evidence for your hypotheses against alternative hypotheses. If you are unable to form any expectations regarding the effects within your study, you probably should consider a traditional NHST to assess whether there is an indication of any effect, and limiting your Type 1 error rate in doing so. In some sense, this is a “last resort”, but in psychology, where quantitative predictions are inherently difficult, something I reluctantly have to rely on quite frequently. Instead of a hypothesis test, you could also consider simply estimating the effect-size in that case, with a suitable credible interval.

16.7 In practice

The steps involved in conducting a Bayesian hypothesis test are not too different from the steps involved in conducting a Frequentist hypothesis test, with the additional step of choosing prior distributions over the values of the model parameters.

Explore the data. Plot distributions of the data within the conditions of the experiment (if any), pairwise scatterplots between numerical predictors and the dependent variable, etc. Consider what model you might use to analyse the data, and assess the validity of the underlying assumptions.

Choose an appropriate general statistical model. In many cases, this will be a version of the GLM or an extension such as a linear mixed-effects model, perhaps using suitably transformed dependent and independent variables.

Choose appropriate prior distributions for the model parameters. This is generally the most difficult part. If you have prior data, then you could base the prior distributions on this. If not, then ideally, formulate prior distributions which reflect your beliefs about the data. You can check whether the prior distributions lead to sensible predictions by simulating data from the resulting model (i.e., computing prior predictive distributions). Otherwise, you can resort to “default” prior distributions.

Conduct the analysis. To test null-hypotheses, compare the general model to a set of restricted models which fix a parameter to a particular value (e.g. 0), and compute the Bayes Factor for each of these comparisons. To help you interpret the magnitude of the Bayes Factor, you can consult Table 16.1 . Where possible, consider conducting a robustness analysis, by e.g. varying the scaling factor of the prior distributions. This will inform you about the extent to which the results hinge on a particular choice of prior, or whether they hold for a range of prior distributions.

Report the results. Make sure that you describe the statistical model, as well as the prior distributions chosen. The latter is crucial, as Bayes Factors are not interpretable without knowing the prior distributions. For example, the results of the analysis in Table 16.2 , with additional results from the posterior parameter distributions, may be reported as:

To analyse the effect of rated attractiveness, intelligence, and fun on the liking of dating partners, we used a Bayesian linear regression analysis ( Rouder & Morey, 2012 ) . In the model, we allowed the effect of attractiveness and fun to be moderated by intelligence. All predictors were mean-centered before entering the analysis. We used a default JZS-prior for all parameters, with a medium scaling factor of \(r = \sqrt{2}/4\) , as recommended by Richard D. Morey & Rouder ( 2018 ) . The analysis showed “extreme” evidence for effects of attractiveness, intelligence, and fun ( \(\text{BF}_{1,0} > 1000\) ; comparing the model to one with a point-prior at 0 for each effect). All effects were positive, with the posterior means of the slopes equalling \(\hat{\beta}_\text{attr} = 0.345\) , 95% HDI [0.309; 0.384], \(\hat{\beta}_\text{intel} = 0.257\) , 95% HDI [0.212; 0.304], and \(\hat{\beta}_\text{fun} = 0.382\) , 95% HDI [0.342; 0.423]. In addition, we found “very strong” evidence for a moderation of the effect of attractiveness by intelligence ( \(BF_{0,1} = 37.459\) ). For every one-unit increase in rated intelligence, the effect of attraciveness reduced by \(\hat{\beta}_{\text{attr} \times \text{intel}} = 0.043\) , 95% HDI [-0.066; -0.02]. There was only “anecdotal” evidence for a moderation of the effect of fun by intelligence ( \(BF_{0,1} = 2.052\) ). Although we don’t place too much confidence in this result, it indicates that for every one-unit increase in rated intelligence, the effect of fun increased by \(\hat{\beta}_{\text{fun} \times \text{intel}} = 0.032\) , 95% HDI [0.01; 0.055].

16.8 “Summary”

Figure 16.5: ‘Piled Higher and Deeper’ by Jorge Cham www.phdcomics.com. Source: https://phdcomics.com/comics/archive.php?comicid=905

An Introduction to Data Analysis

11 bayesian hypothesis testing.

This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses . A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters. Most often, such a hypothesis concerns one parameter, and the assumption in question is that this parameter takes on a specific value, or some value from a specific interval. Henceforth, we speak just of a “hypothesis” even though we mean a specific hypothesis about particular model parameters. For example, we might be interested in what we will call a point-valued hypothesis , stating that the value of parameter \(\theta\) is fixed to a specific value \(\theta = \theta^*\) . Section 11.1 introduces different kinds of statistical hypotheses in more detail.

Given a statistical hypothesis about parameter values, we are interested in “testing” it. Strictly speaking, the term “testing” should probably be reserved for statistical decision procedures which give clear categorical judgements, such as whether to reject a hypothesis, accept it as true or to withhold judgement because no decision can be made (yet/currently). While we will encounter such categorical decision routines in this chapter, Bayesian approaches to hypotheses “testing” are first and foremost concerned, not with categorical decisions, but with quantifying evidence in favor or against the hypothesis in question. (In a second step, using Bayesian decision theory which also weighs in the utility of different policy choices, we can use Bayesian inference also for informed decision making, of course.) But instead of speaking of “Bayesian inference to weigh evidence for/against a hypothesis” we will just speak of “Bayesian hypothesis testing” for ease of parlor.

We consider two conceptually distinct approaches within Bayesian hypothesis testing.

  • Estimation-based testing considers just one model. It uses the observed data \(D_\text{obs}\) to retrieve posterior beliefs \(P(\theta \mid D_{\text{obs}})\) and checks whether, a posteriori , our hypothesis is credible.
  • Comparison-based testing uses Bayesian model comparison, in the form of Bayes factors, to compare two models, namely one model that assumes that the hypothesis in question is true, and one model that assumes that the complement of the hypothesis is true.

The main difference between these two approaches is that estimation-based hypothesis testing is simpler (conceptually and computationally), but less informative than comparison-based hypothesis testing. In fact, comparison-based methods give a clearer picture of the quantitative evidence for/against a hypothesis because they explicitly take into account a second alternative to the hypothesis which is to be tested. As we will see in this chapter, the technical obstacles for comparison-based approaches can be overcome. For special but common use cases, like testing directional hypotheses, there are efficient methods of performing comparison-based hypothesis testing.

The learning goals for this chapter are:

  • point-valued, ROPE-d and directional hypotheses
  • complement / alternative hypothesis
  • be able to apply Bayesian hypothesis testing to (simple) case studies
  • understand and be able to apply the Savage-Dickey method (and its extension to interval-based hypotheses in terms of encompassing models )
  • become familiar with a Bayesian \(t\) -test model for comparing the means of two groups of metric measurements

Bayesian Hypothesis Testing with PyMC

Austin Rochford

Lately I’ve been reading the excellent, open source book Probabilistic Programming and Bayesian Methods for Hackers . The book’s prologue opens with the following line.

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis.

As a mathematician with training in statistics as well, this comment rings quite true. Even with a firm foundation in the probability theory necessary for Bayesian inference, the calculations involved are often incredibly tedious. An introduction to Bayesian statistics often invovles the hazing ritual of showing that, if \(X | \mu \sim N(\mu, \sigma^2)\) and \(\mu \sim N(\mu_0, \sigma_0^2)\) where \(\sigma^2\) , \(\mu_0\) , and \(\sigma_0^2\) are known, then

\[\mu | X \sim N \left(\frac{\frac{X}{\sigma^2} + \frac{\mu_0}{\sigma_0^2}}{\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}}\right).\]

If you want to try it yourself, Bayes' Theorem and this polynomial identity will prove helpful.

Programming and Bayesian Methods for Hackers emphasizes the utility of PyMC for both eliminating tedious, error-prone calculations, and also for approximating posterior distributions in models that have no exact solution. I was recently reminded of PyMC’s ability to eliminate tedious, error-prone calculations during my statistics final, which contained a problem on Bayesian hypothesis testing . This post aims to present both the exact theoretical analysis of this problem and its approximate solution using PyMC.

The problem is as follows.

Given that \(X | \mu \sim N(\mu, \sigma^2)\) , where \(\sigma^2\) is known, we wish to test the hypothesis \(H_0: \mu = 0\) vs.  \(H_A: \mu \neq 0\) . To do so in a Bayesian framework, we need prior probabilities on each of the hypotheses and a distribution on the parameter space of the alternative hypothesis. We assign the hypotheses equal prior probabilities, \(\pi_0 = \frac{1}{2} = \pi_A\) , which indicates that prior to observing the value of \(X\) , we believe that each hypothesis is equally likely to be true. We also endow the alternative parameter space, \(\Theta_A = (-\infty, 0) \cup (0, \infty)\) , with a \(N(0, \sigma_0^2)\) distribution, where \(\sigma_0^2\) is known.

The Bayesian framework for hypothesis testing relies on the calculation of the posterior odds of the hypotheses,

\[\textrm{Odds}(H_A|x) = \frac{P(H_A | x)}{P(H_0 | x)} = BF(x) \cdot \frac{\pi_A}{\pi_0},\]

where \(BF(x)\) is the Bayes factor. In our situation, the Bayes factor is

\[BF(x) = \frac{\int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu}{f(x|0)}.\]

The Bayes factor is the Bayesian counterpart of the likelihood ratio , which is ubiquitous in frequentist hypothesis testing. The idea behind Bayesian hypothesis testing is that we should choose whichever hypothesis better explains the observation, so we reject \(H_0\) when \(\textrm{Odds}(H_A) > 1\) , and accept \(H_0\) otherwise. In our situation, \(\pi_0 = \frac{1}{2} = \pi_A\) , so \(\textrm{Odds}(H_A) = BF(x)\) . Therefore we base our decision on the value of the Bayes factor.

In the following sections, we calculate this Bayes factor exactly and approximate it with PyMC. If you’re only interested in the simulation and would like to skip the exact calculation (I can’t blame you) go straight to the section on PyMC approximation .

Exact Calculation

The calculation becomes (somewhat) simpler if we reparamatrize the normal distribution using its precision instead of its variance. If \(X\) is normally distributed with variance \(\sigma^2\) , its precision is \(\tau = \frac{1}{\sigma^2}\) . When a normal distribution is parametrized by precision, its probability density function is

\[f(x|\mu, \tau) = \sqrt{\frac{\tau}{2 \pi}} \textrm{exp}\left(-\frac{\tau}{2} (x - \mu)^2\right).\]

Reparametrizing the problem in this way, we get \(\tau = \frac{1}{\sigma^2}\) and \(\tau_0 = \frac{1}{\sigma_0^2}\) , so

\[\begin{align} f(x|\mu) \rho_A(\mu) & = \left(\sqrt{\frac{\tau}{2 \pi}} \textrm{exp}\left(-\frac{\tau}{2} (x - \mu)^2\right)\right) \cdot \left(\sqrt{\frac{\tau_0}{2 \pi}} \textrm{exp}\left(-\frac{\tau_0}{2} \mu^2\right)\right) \\ & = \frac{\sqrt{\tau \cdot \tau_0}}{2 \pi} \cdot \textrm{exp} \left(-\frac{1}{2} \left(\tau (x - \mu)^2 + \tau_0 \mu^2\right)\right). \end{align}\]

Focusing momentarily on the sum of quadratics in the exponent, we rewrite it as \[\begin{align} \tau (x - \mu)^2 + \tau_0 \mu^2 & = \tau x^2 + (\tau + \tau_0) \left(\mu^2 - 2 \frac{\tau}{\tau + \tau_0} \mu x\right) \\ & = \tau x^2 + (\tau + \tau_0) \left(\left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2 - \left(\frac{\tau}{\tau + \tau_0}\right)^2 x^2\right) \\ & = \left(\tau - \frac{\tau^2}{\tau + \tau_0}\right) x^2 + (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2 \\ & = \frac{\tau \tau_0}{\tau + \tau_0} x^2 + (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2. \end{align}\]

Therefore \[\begin{align} \int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu & = \frac{\sqrt{\tau \tau_0}}{2 \pi} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right) \int_{-\infty}^\infty \textrm{exp}\left(-\frac{1}{2} (\tau + \tau_0) \left(\mu - \frac{\tau}{\tau + \tau_0} x\right)^2\right)\ d\mu \\ & = \frac{\sqrt{\tau \tau_0}}{2 \pi} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right) \cdot \sqrt{\frac{2 \pi}{\tau + \tau_0}} \\ & = \frac{1}{\sqrt{2 \pi}} \cdot \sqrt{\frac{\tau \tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right). \end{align}\]

The denominator of the Bayes factor is \[\begin{align} f(x|0) & = \sqrt{\frac{\tau}{2 \pi}} \cdot \textrm{exp}\left(-\frac{\tau}{2} x^2\right), \end{align}\] so the Bayes factor is \[\begin{align} BF(x) & = \frac{\frac{1}{\sqrt{2 \pi}} \cdot \sqrt{\frac{\tau \tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{1}{2} \left(\frac{\tau \tau_0}{\tau + \tau_0}\right) x^2\right)}{\sqrt{\frac{\tau}{2 \pi}} \cdot \textrm{exp}\left(-\frac{\tau}{2} x^2\right)} \\ & = \sqrt{\frac{\tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(-\frac{\tau}{2} \left(\frac{\tau_0}{\tau + \tau_0} - 1\right) x^2\right) \\ & = \sqrt{\frac{\tau_0}{\tau + \tau_0}} \cdot \textrm{exp}\left(\frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2\right). \end{align}\]

From above, we reject the null hypothesis whenever \(BF(x) > 1\) , which is equivalent to \[\begin{align} \textrm{exp}\left(\frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2\right) & > \sqrt{\frac{\tau + \tau_0}{\tau_0}}, \\ \frac{1}{2} \left(\frac{\tau^2}{\tau + \tau_0}\right) x^2 & > \frac{1}{2} \log\left(\frac{\tau + \tau_0}{\tau_0}\right),\textrm{ and} \\ x^2 & > \left(\frac{\tau + \tau_0}{\tau^2}\right) \cdot \log\left(\frac{\tau + \tau_0}{\tau_0}\right). \end{align}\]

As you can see, this calculation is no fun. I’ve even left out a lot of details that are only really clear once you’ve done this sort of calculation many times. Let’s see how PyMC can help us avoid this tedium.

PyMC Approximation

PyMC is a Python module that uses Markov chain Monte Carlo methods (and others) to fit Bayesian statistical models. If you’re unfamiliar with Markov chains , Monte Carlo methods , or Markov chain Monte Carlo methods, each of which is a an important topic in its own right, PyMC provides a set of tools to approximate marginal and posterior distributions of Bayesian statistical models.

To solve this problem, we will use PyMC to approximate \(\int_{\Theta_A} f(x|\mu) \rho_A(\mu)\ d\mu\) , the numerator of the Bayes factor. This quantity is the marginal distribution of the observation, \(X\) , under the alternative hypothesis.

We begin by importing the necessary Python packages.

In order to use PyMC to approximate the Bayes factor, we must fix numeric values of \(\sigma^2\) and \(\sigma_0^2\) . We use the values \(\sigma^2 = 1\) and \(\sigma_0^2 = 9\) .

We now initialize the random variable \(\mu\) .

Note that PyMC’s normal distribution is parametrized in terms of the precision and not the variance. When we initialize the variable x , we use the variable mu as its mean.

PyMC now knows that the distribution of x depends on the value of mu and will respect this relationship in its simulation.

We now instantiate a Markov chain Monte Carlo sampler, and use it to sample from the distributions of mu and x .

In the second line above, we tell PyMC to run the Markov chain Monte Carlo simulation for 50,000 steps, using the first 10,000 steps as burn-in, and then count each o the last 40,000 steps towards the sample. The burn-in period is the number of samples we discard from the beginning of the Markov chain Monte Carlo algorithm. A burn-in period is necessary to assure that the algorithm has converged to the desired distribution before we sample from it.

Finally, we may obtain samples from the distribution of \(X\) under the alternative hypothesis.

From our exact calculations, we expect that, given the alternative hypothesis, \(X \sim N\left(0, \frac{\tau \tau_0}{\tau + \tau_0}\right)\) . The following chart shows both the histogram derived from x_samples and the probability density function of \(X\) under the alternative hypothesis.

python bayesian hypothesis testing

It’s always nice to see agreement between theory and simulation. The problem now is that we need to evaluate the probability distribution function of \(X|H_A\) at the observed point \(X = x_0\) , but we only know the cumulative distribution function of \(X\) (via its histogram, computed from x_samples ). Enter kernel density estimation , a nonparametric method for estimating the probability density function of a random variable from samples. Fortunately, SciPy provides an excellent module for kernel density estimation .

We define two functions, the first of which gives the simulated Bayes factor, and the second of which gives the exact Bayes factor.

The following figure shows the excellent agreement between the simulated and calculated bayes factors.

python bayesian hypothesis testing

The only reason it’s possible to see the red graph of simulated Bayes factor behind the blue graph of the exact Bayes factor is that we’ve doubled the width of the red graph. In fact, on the interval \([0, 2]\) , the maximum relative error of the simulated Bayes factor is approximately 0.9%.

We now use the simulated Bayes factor to approximate the critical value for rejecting the null hypothesis.

The SciPy function opt.brentq is used to find a solution of the equation bayes_factor_sim(x) = 1 , which is equivalent to finding a zero of bayes_factor_crit_helper . We plug x_crit into bayes_factor_exact in order to verify that we have, in fact, found the critical value.

This value is quite close to one, so we have in fact approximated the critical point well.

It’s interesting to note that we have used PyMC in a somewhat odd way here, to approximate the marginal distribution of \(X\) under the null hypothesis. A much more typical use of PyMC and its Markov chain Monte Carlo would be to fix an observed value of \(X\) and approximate the posterior distribution of \(\mu\) given this observation.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Entropy (Basel)

Logo of entropy

A Review of Bayesian Hypothesis Testing and Its Practical Implementations

Associated data.

The sleep and the ToothGrowth datasets are built in R. The Poisson repeated-measures dataset is simulated according to Appendix A .

We discuss hypothesis testing and compare different theories in light of observed or experimental data as fundamental endeavors in the sciences. Issues associated with the p -value approach and null hypothesis significance testing are reviewed, and the Bayesian alternative based on the Bayes factor is introduced, along with a review of computational methods and sensitivity related to prior distributions. We demonstrate how Bayesian testing can be practically implemented in several examples, such as the t -test, two-sample comparisons, linear mixed models, and Poisson mixed models by using existing software. Caveats and potential problems associated with Bayesian testing are also discussed. We aim to inform researchers in the many fields where Bayesian testing is not in common use of a well-developed alternative to null hypothesis significance testing and to demonstrate its standard implementation.

1. Introduction

Hypothesis testing is an important tool in modern research. It is applied in a wide range of fields, from forensic analysis, business intelligence, and manufacturing quality control, to the theoretical framework of assessing the plausibility of theories in physics, psychology, and fundamental science [ 1 , 2 , 3 , 4 , 5 ]. The task of comparing competing theories based on data is essential to scientific activity, and therefore, the mechanism of conducting these comparisons requires thoughtful consideration [ 6 , 7 ].

The dominant approach for these comparisons is based on hypothesis testing using a p -value, which is the probability, under repeated sampling, of obtaining a test statistic at least as extreme as the observed under the null hypothesis [ 4 , 8 ]. Records of conceptualizing the p -value date back at least two hundred years before Ronald Fisher established the p -value terminology and technique [ 9 , 10 , 11 ]. These records are an indication of how compelling and popular the approach is, and the long history explains the widespread acceptance of a decision rule with a fixed type I error rate, which further resulted in the adoption of a 5% significance-level cutoff. Despite its prevalence, there has been an intense debate about the misuse of the p -value approach [ 7 , 12 ]. The major criticisms about the p -value are its inability to quantify evidence for the null hypothesis and its tendency to overestimate the evidence against the null hypothesis [ 4 ]. For example, a possible decision based on the p -value is the rejection of the null hypothesis but not its acceptance. Under the null hypothesis, the  p -value will have a uniform [0, 1] distribution regardless of the sample size. This is by construction. The Bayesian approach behaves rather differently under the null hypothesis, and increasing sample sizes will provide increasing degrees of evidence in favor of the null hypothesis [ 13 ].

Besides the misuse, the hypothesis testing approach based on the p -value can be easily misinterpreted. A list of twenty-five examples of misinterpretations in classical hypothesis testing is provided in [ 14 ]. Eighteen of these items are directly related to the misunderstanding of the p -value, and the others are related to p -values in the context of confidence intervals and statistical power. Some of these points are also shared in [ 15 ], including the common misconceptions that a nonsignificant difference means that there is no difference between groups and that the p -value represents the chance of a type I error. The author also highlights an alternative approach, based on the Bayes factor as a measure of true evidential meaning about the hypotheses [ 16 , 17 ]. Private pages of Alan Turing independently discovered this quantity around the same time as Jeffrey [ 16 , 18 , 19 ]. Other authors have also recommended the Bayes factor as a better solution to hypothesis testing compared with the practice of p -values and null hypothesis significance testing (NHST), specifically criticizing the p -value’s dependence on hypothetical data, which are likely to be manipulated by the researcher’s intentions [ 8 ].

While the majority of the issues with classical hypothesis testing are crucial and widely known, a less acknowledged but important misinterpretation happens when two or more results are compared by their degrees of statistical significance [ 20 ]. To illustrate this issue, consider the following example introduced in [ 14 ]. Suppose two independent studies have effect estimates and standard errors of 25 ± 10 and 10 ± 10 . In that case, the first study has a mean that is 2.5 standard errors away from 0, being statistically significant at an alpha level of 1%. The second study has a mean that is 1 standard error away from 0 and is not statistically significant at the same alpha level. It is tempting to conclude that the results of the studies are very different. However, the estimated difference in treatment effects is 25 − 10 = 15 , with a standard error 10 2 + 10 2 ≈ 14 . Thus, the mean of 15 units is less than 1 standard error away from 0, indicating that the difference between the studies is not statistically significant. If a third independent study with a much larger sample size had an effect estimate of 2.5 ± 1.0 , then it would have a mean that is 2.5 standard errors away from 0 and indicate statistical significance at an alpha level of 1%, as in the first study. In this case, the difference between the results of the third and the first studies would be 22.5 with a standard error 10 2 + 1 ≈ 10 . Thus, the mean of 22.5 units would be more than 2 standard errors away from 0, indicating a statistically significant difference between the studies. Therefore, the researchers in [ 20 ] recommend that the statistical significance of the difference between means be considered, rather than the difference between the significance levels of the two hypotheses.

To prevent the misuse and misinterpretation of p -values, the American Statistical Association (ASA) issued a statement clarifying six principles for the proper use and interpretation of classical significance testing [ 12 ]: (i) p -values can indicate how incompatible the data are with a specified statistical model; (ii) p -values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone; (iii) scientific conclusions and business or policy decisions should not be based only on whether a p -value passes a specific threshold; (iv) proper inference requires full reporting and transparency; (v) p -value, or statistical significance, does not measure the size of an effect or the importance of a result; and (vi) by itself, a  p -value does not provide a good measure of evidence regarding a model or hypothesis.

The profound criticism of the p -value approach has promoted the consideration and development of alternative methods for hypothesis testing [ 4 , 8 , 12 , 21 ]. The Bayes factor is one such instance [ 18 , 22 ], since it only depends on the observed data and allows an evaluation of the evidence in favor of the null hypothesis. The seminal paper by Kass and Raftery [ 17 ] discusses the Bayes factor along with technical and computational aspects and presents several applications in which the Bayes factor can solve problems that cannot be addressed by the p -value approach. Our review differs in that it is targeted towards researchers in fields where the p -value is still in dominant use, and there are many such fields where this is the case. Our emphasis is to provide these researchers with an understanding of the methodology and potential issues, and a review of the existing tools to implement the Bayes factor in statistical practice.

Two potential issues for the implementation of the Bayes factor are the computation of integrals related to the marginal probabilities that are required to evaluate them and the subjectivity regarding the choosing of the prior distributions [ 7 , 17 ]. We will review these issues in Section 2 and Section 3 , respectively. Despite these difficulties, there are many advantages to the use of the Bayes factor, including (i) the quantification of the evidence in favor of the null hypothesis [ 15 ], (ii) the ease of combining Bayes factors across experiments, (iii) the possibility of updating results when new data are available, (iv) interpretable model comparisons, and (v) the availability of open-source tools to compute Bayes factors in a variety of practical applications.

This paper aims to provide examples of practical implementations of the Bayes factor in different scenarios, highlighting the availability of tools for its computation for those with a basic understanding of statistics. In addition, we bring attention to the over-reliance on the classical p -value approach for hypothesis testing and its inherent pitfalls. The remainder of the article is structured as follows. In  Section 2 , we define the Bayes factor and discuss technical aspects, including its numerical computation. In  Section 3 , we discuss prior distributions and the sensitivity of the Bayes factor to prior distributions. Section 4 presents several applications of the Bayes factor using open-source code involving R software. We illustrate the computation of the Bayes factor using a variety of approximation techniques. Section 5 presents a discussion and summary.

2. Bayes Factor Definition and Technical Aspects

2.1. definition.

The Bayes factor is defined as the ratio of the probability of the observed data , conditional on two competing hypotheses or models. Given the same data D and two hypotheses H 0 and H 1 , it is defined as

If there is no previous knowledge in favor of one theory over the other, i.e., the hypotheses H 0 and H 1 are equally probable a priori ( p ( H 1 ) = p ( H 0 ) ), the Bayes factor represents the ratio of the data-updated knowledge about the hypotheses, i.e., the Bayes factor is equal to the posterior odds, where the posterior probability is defined as the conditional probability of the hypothesis given the data. Using the definition of conditional probability and under the assumption that the hypotheses are equally probable a priori,

Based on Equation ( 2 ), we can interpret the Bayes factor as the extent to which the data update the prior odds, and therefore, quantify the support for one model over another. A Bayes factor value smaller than one indicates that the data are more likely under the denominator model than they are under the numerator model. A model with the highest Bayes factor shows the relatively highest amount of evidence in favor of the model compared to the other models. Similarly, by switching the indices in ( 1 ), B F 01 is defined as

where larger values of B F 01 represent higher evidence in favor of the null hypothesis.

The Bayes factor can be viewed as a summary of the evidence given by data in support of one hypothesis in contrast to another [ 7 , 17 ]. Reporting Bayes factors can be guided by setting customized thresholds according to particular applications. For example, Evett [ 1 ] argued that for forensic evidence alone to be conclusive in a criminal trial, it would require a Bayes factor of at least 1000 rather than the value of 100 suggested by the Jeffreys scale of interpretation [ 18 ]. A generally accepted table provided in [ 17 ] is replicated in Table 1 , and other similar tables are available in [ 21 ]. Thus, using the Bayes factor can result in reporting evidence in favor of the alternative hypothesis, evidence in favor of the null hypothesis, or reporting that the data are inconclusive.

General-purpose interpretation of Bayes factor values from [ 17 ].

Interpretation of Evidence against
1 to 3Not worth more than a bare mention
3 to 20Positive
20 to 150Strong
>150Very Strong

The Bayes factor can avoid the drawbacks associated with p -values and assess the strength of evidence in favor of the null model along with various additional advantages. First, Bayes factors inherently include a penalty for complex models to prevent overfitting. Such a penalty is implicit in the integration over parameters required to obtain marginal likelihoods. Second, the Bayes factor can be applied in statistical settings that do not satisfy common regularity conditions [ 17 ].

Despite its apparent advantages, there are a few disadvantages to the Bayes factor approach. First, the choice of a prior distribution is subjective [ 4 , 7 , 17 ] and might be a concern for some researchers. However, the authors in [ 7 ] challenge the criticism, claiming that there is nothing about the data, by itself, that assures it counts as evidence. The pathway from the data to evidence is filled with subjective evaluations when combing the theoretical viewpoint with the research question. Therefore, the Bayesian approach makes explicit assumptions based on the prior likelihood statement. A way to avoid the explicit selection of prior densities is through the use of the Bayesian information criterion (BIC), which can give a rough interpretation of evidence in Table 1 .

Another potential disadvantage is the computational difficulty of evaluating marginal likelihoods, and this is discussed in Section 2.2 . However, the issue is being mitigated by the growth of computational power and the availability of open-source statistical tools for this computation. Examples of these tools are BayesFactor , brms , and BFpack R packages [ 23 , 24 , 25 ]; and JASP [ 26 ] software. In  Section 4 , we illustrate the required R scripting for a number of examples widely used in data analysis. As Python has become increasingly popular among quantitative practitioners [ 27 , 28 ], R packages for the computation of Bayes factors can be imported into Python using the rpy2 package [ 29 ]. Thanks to these advancements, Bayes factors are gradually gaining wider attention in research [ 30 , 31 , 32 ].

2.2. Computation of the Bayes Factor

To calculate the Bayes factor, both the numerator and the denominator in the Bayes factor definition ( 1 ) (the marginal likelihood of the data under a given model) involve integrals over the parameter space:

where θ k is the parameter vector under the hypothesis H k , and  p ( θ k | H k ) is the prior probability density function of the parameter vector for the hypothesis H k . It is typical for ( 4 ) to be an integral over many dimensions so that the computational problem can be difficult.

If we assume the data are a random sample from an exponential family distribution and assume conjugate priors, it is possible to solve the integral in ( 4 ) analytically. Without conjugacy, these integrals are often intractable, and numerical methods are needed. Many available numerical integration techniques are inefficient to calculate such integrals because it is difficult to find the regions where the probability mass is accumulating in higher dimensions. For regular problems in the large sample setting, the probability mass will accumulate and tend to peak around the maximum likelihood estimator (MLE) [ 17 , 33 ]. This notion underlies the Laplace approximation and its variations which can be used to obtain an approximation to the Bayes factor. These methods rely on a quadratic approximation to the logarithm of the integrand obtained using a Taylor expansion about the MLE and a normal distribution matching. Laplace’s methods are usually fast but not very accurate. An alternative approximation known as the Savage–Dickey density ratio [ 34 ] can be applied to obtain a better approximation for the case of nested models when testing a constrained model against an unrestricted alternative, the Bayes factor is approximated by dividing the value of the posterior density over the parameters for the alternative model evaluated at the hypothesized value, by the prior for the same model evaluated at the same point [ 35 ].

For the general case of Bayes factor computations, it is common to resort to sampling-based numerical procedures adjusted to the context of marginal likelihood computation as in ( 4 ). Evans and Swartz [ 36 ] reviewed several numerical strategies for assessing the integral related to the Bayes factor and later published a book on the topic [ 37 ]. Among the methods for estimating the integral of the marginal likelihood, the bridge sampling technique has gained prominence [ 38 ]. The method encompasses three special cases, namely the “naïve” [ 33 ] or “simple” [ 17 ] Monte Carlo estimator, the importance sampling, and the generalized harmonic mean estimator. The bridge sampling estimate stands out for not being dominated by samples from the tails of the distribution [ 33 ]. An entitled bridgesampling R package to estimate integrals with the bridge sampling algorithm for Bayesian models implemented in Stan [ 39 ] or JAGS [ 40 ] is available [ 41 ]. In  Section 4 , we provide examples of using the bridgesampling package and the BayesFactor R package [ 23 ] to enable the computation of Bayes factors for several important experimental designs.

3. Prior Elicitation and Sensitivity Analysis

Based on its definition in ( 1 ), the Bayes factor is a ratio of the marginal likelihood of two competing models. The marginal likelihood for a model class is a weighted average of the likelihood over all the parameter values represented by the prior distribution [ 42 ]. Therefore, carefully choosing priors and conducting a prior sensitivity analysis play an essential role when using Bayes factors as a model selection tool. This section briefly discusses the prior distributions, prior elicitation, and prior sensitivity analysis.

3.1. Prior Distributions

In Bayesian statistical inference, a prior probability distribution (or simply called the prior) estimates the probability of incorporating one’s beliefs or prior knowledge about an uncertain quantity before collecting the data. The unknown quantity may be a parameter of the model or a latent variable. In Bayesian hierarchical models, we have more than one level of prior distribution corresponding to a hierarchical model structure. The parameters of a prior distribution are called hyperparameters. We can either assume values for the hyperparameters or assume a probability distribution, which is referred to as a hyperprior.

It is common to categorize priors into four types: informative priors, weakly informative priors, uninformative priors, and improper priors [ 43 ]. The Bayes factor computation requires proper priors, i.e., a prior distribution that integrates to 1. Various available software provide default priors, but it is the researchers’ responsibility to perform sensitivity analysis to check the impact of applying different priors.

3.2. Prior Elicitation

The prior distribution is an important ingredient of the Bayesian paradigm and must be designed coherently to make Bayesian inference operational [ 44 ]. Priors can be elicited using multiple methods, e.g., from past information, such as previous experiments, or elicited purely from the experts’ subjective assessments. When no prior information is available, an uninformative prior can be assumed, and most of the model information that is given by the posterior will come from the likelihood function itself. Priors can also be chosen according to some principles, such as symmetry or maximum entropy, given constraints. Examples are the Jeffreys prior [ 18 ] and Bernardo’s reference prior [ 45 ]. When a family of conjugate priors exist, choosing a prior from that family simplifies the calculation of the posterior distribution.

With the advancement of computational power, ad hoc searching for priors can be done more systemically. Hartmann et al. [ 46 ] utilized the prior predictive distribution implied by the model to automatically transform experts’ judgments about plausible outcome values to suitable priors on the parameters. They also provided computational strategies to perform inference and guidelines to facilitate practical use. Their methodology can be summarized as follows: (i) define the parametric model for observable data conditional on the parameters θ and a prior distribution with hyperparameters λ for the parameters θ , (ii) obtain experts’ beliefs or probability for each mutually exclusive data category partitioned from the overall data space, (iii) model the elicited probabilities from step 2 as a function of the hyperparameters λ , (iv) perform iterative optimization of the model from step 3 to obtain an estimate for λ best describing the expert opinion within the chosen parametric family of prior distributions, and (v) evaluate how well the predictions obtained from the optimal prior distribution can describe the elicited expert opinion. Prior predictive tools relying on machine learning methods can be useful when dealing with hierarchical modeling where a grid search method is not possible [ 47 ].

3.3. Sensitivity Analysis

In the Bayesian approach, it is important to evaluate the impact of prior assumptions. This is performed through a sensitivity analysis where the prior is perturbed, and the change in the results is examined. Various authors have demonstrated how priors affect Bayes factors and provided ways to address the issue. When comparing two nested models in a low dimensional parameter space, the authors in [ 48 ] propose a point mass prior Bayes factor approach. The point mass prior distribution for the Bayes factor is computed for a grid of extra parameter values introduced by a generalized alternative model. The resulting Bayes factor is obtained by averaging the point mass prior Bayes factor over the prior distribution of the extra parameter(s).

For binomial data, Ref. [ 42 ] shows the impact of different priors on the probability of success. The authors used four different priors: (i) a uniform distribution, (ii) the Jeffreys prior, which is a proper Beta(0.5,0.5) distribution, (iii) the Haldane prior by assuming a Beta(0,0) distribution (an improper prior), and (iv) an informative prior. The uniform, Jeffreys, and Haldane priors are noninformative in some sense. Although the resulting parameter estimation is similar in all four scenarios, the resulting Bayes factor and posterior probability of H 1 vary. Using the four different priors produces very different Bayes factors with values of 0.09 for the Haldane, 0.6 for the Jeffreys, 0.91 for the uniform, and  1.55 for the informative prior. The corresponding posterior probabilities of H 1 are 0.08 (Haldane), 0.38  (Jeffreys), 0.48 (uniform), and  0.61 (informative). In this example, the sensitivity analysis reveals that the effect of the priors on the posterior distribution is different from their effect on the Bayes factor. The authors emphasize that Bayes factors should be calculated, ideally, for a wide range of plausible priors whenever used as a model selection tool. Besides using the Bayes factor based on prior predictive distribution, they also suggest seeking agreement with the other model selection criterion designed to assess local model generalizability (i.e., based on posterior predictive distribution).

The author in [ 49 ] describe several interesting points with regards to prior sensitivity. The author views prior sensitivity analysis in theory testing as an opportunity rather than a burden. They argue that it is an attractive feature of a model evaluation measure when psychological models containing quantitatively instantiated theories are sensitive to priors. Ref.  [ 49 ] believes that using an informative prior expressing a psychological theory and evaluating models using prior sensitivity measures can serve to advance knowledge. Finally, sensitivity analysis is accessible through an interactive Shiny Application developed by the authors in [ 50 ]. The software is designed to help user understand how to assess the substantive impact of prior selection in an interactive way.

4. Applications of the Bayes Factor Using R Packages

In this section, we illustrate how to calculate Bayes factors using various techniques available in R, including the R package BayesFactor [ 23 ]. Various authors have used this package to compute Bayes factors in different settings such as linear correlations, Bayesian t -tests, analysis of variance (ANOVA), linear regression, single proportions, and contingency tables [ 51 , 52 , 53 , 54 ]. Comparisons between Bayesian and frequentist approaches are provided in the vignettes of [ 23 ]. We provide the R code to compute the Bayes factor for a one-sample t -test, a multiway ANOVA, a repeated-measures design, and a Poisson generalized linear mixed model (GLM).

4.1. One-Sample t-Test

The authors in [ 52 ] derived the Jeffreys Zellner Siow (JZS) Bayes factor as a function of the t -score and the sample size. To illustrate how the ttestBF function of the BayesFactor package performs a Bayesian paired t -test, they analyzed the sleep dataset [ 55 ], which includes the variable, i.e., the length of increased sleep (in hours) after taking two drugs when compared to regular nights where no drug was administered. The Bayesian paired t -test can evaluate if the levels of effectiveness of two drugs are significantly different (a null hypothesis is that the standardized effect size is zero) [ 7 , 52 ].

Let y 1 , … , y n ∼ i . i . d . N ( σ δ , σ 2 ) , where the standardized effect size is given by δ = μ / σ , μ is a grand mean, and  σ 2 is the error variance. We test the following hypotheses:

The following script of R code implements the Bayesian paired t -test and presents the p -value of the classical approach for comparison.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i001.jpg

The value r = 0.707 ( 2 / 2 ) denotes the scale of a Cauchy prior distribution of δ . The Bayes factor value of 17.259 favors the alternative hypothesis, indicating that the effect size is significant in this case. Using the interpretation in Table 1 , the evidence against the null hypothesis is “positive”. The classical p -value of around 0.3% is also in favor of the alternative, usually considered strong evidence against the null hypothesis.

For this example, the Bayes factor can also be computed by employing a bridge sampling estimate. The R packages bridgesampling and R2jags used concepts of object-oriented programming and were developed with methods to interact with customizable Markov chain Monte Carlo object routines [ 41 , 56 ]. That is to say, a self-coded in JAGS model can feed the bridgesampling ’s function bridge_sampler to obtain the log marginal likelihood for the model. Their source code (assuming the same priors in [ 23 ]) is available at https://osf.io/3yc8q/ (accessed on 28 December 2021). The Bayes factor value in [ 41 ] for the sleep data is 17.260, which is almost identical to the BayesFactor package result, 17.259. Both the BayesFactor and bridgesampling packages suit the analysis needs. On the one hand, no additional programming knowledge is required to call the functions in the BayesFactor package due to the default prior settings, which are user friendly. On the other hand, the  bridgsampling along with JAGS allows for more sophisticated customization and flexibility in model specifications, which makes more feasible to conduct the sensitivity analysis.

4.2. Multiway ANOVA

Consider a two-way ANOVA model M 1 : y i j k = μ + σ ( τ i + β j + γ i j ) + ϵ i j k , for  i = 1 , ⋯ , a , j = 1 , ⋯ , b , and  k = 1 , ⋯ , n , where y i j k is the response for the k th subject at the i th level of Factor 1 and the j th level of Factor 2, μ is the overall mean effect, τ i is the standardized effect size of the i th level of Factor 1, β j is the standardized effect size of the j th level of Factor 2, γ i j is the standardized effect size of the interaction between two factors, ϵ i j k is a white noise with the mean zero and variance σ 2 . We consider comparing the full top-level model M 1 versus M 0 : y i j k = μ + ϵ i j k .

Equivalently, the competing models can be expressed in the matrix-vector form as in [ 53 ], i.e.,

where y is a column vector of N observations, 1 is a column vector of N ones, τ , β , and  γ are column vectors of standardized effect parameters of length a , b , and  a b , respectively, X ’s are design matrices, and  ϵ   ∣   σ 2 ∼ N ( 0 , σ 2 I ) .

The anovaBF function of the BayesFactor package compares these linear models (including the reduced models). The  ToothGrowth dataset [ 57 ] is used to study the effects of vitamin C dosage and supplement type on tooth growth in guinea pigs. The  anovaBF function allows the model comparison (single-factor models, additive model, and full model) against the null model (intercept only). The following script of R code implements the multiway ANOVA.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i002.jpg

The percentage, e.g.,  ± 2.73 % is the proportional Monte Carlo error estimate on the Bayes factor. The Bayes factor value of 7.94 × 10 14 suggests, according to Table 1 , very strong evidence in favor of the full model.

It is worth noting that the one-way ANOVA with two levels is consistent with the two-sample t -test, when using the default priors. For example, considering the sleep data example, one can check that:

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i003.jpg

return the same Bayes factor value (but the dataset is not meant for the independent tests).

4.3. Repeated-Measures Design

Linear mixed-effects models extend simple linear models to allow both fixed (parameters that do not vary across subjects) and random effects (parameters that are themselves random variables), particularly used when the data are dependent, multilevel, hierarchical, longitudinal, or correlated. In relation to the previous model in Section 4.2 , a linear mixed-effects model M 1 adds the standardized subject-specific random effect b k . We now consider comparing

We take the sleep dataset as an example and specify the argument whichRandom in the anovaBF function of the BayesFactor package, so that it computes the Bayes factor for such a repeated-measures design (or called a within-subjects design). The following script of R code implements the one-way repeated-measures design, where the dataset needs to be in the long format: one column is for the continuous response variable, one column is for the subject indicator, and another column is for the categorical variable indicating the levels.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i004.jpg

This code generates a Bayes factor of about 11.468 in favor of the alternative hypothesis. The conclusion inferred from the repeated-measures designs is consistent with the earlier paired t -test result. One limitation of calling anovaBF function is that it only aims to construct the Bayes factor for a homoscedastic case.

4.4. Poisson Mixed-Effects Model

A GLM Poisson mixed-effects approach aims to model a discrete count event that was repeatedly measured at several conditions for each subject, e.g., longitudinal studies [ 58 ]. The model assumes that the response variable follows a Poisson distribution at the first level. Unlike the cases of normally distributed repeated-measures data, software used to calculate Bayes factors have not been extensively discussed and developed in the context of Bayesian Poisson models. Thus, we illustrate code for sampling the posterior using JAGS, and then the Savage–Dickey density ratio is used to approximate the Bayes factor.

When testing a nested model against an unrestricted alternative, the Bayes factor is computationally and graphically simplified as the ratio calculated by dividing the value of the posterior distribution over the parameters for the alternative model evaluated at the hypothesized value, by the prior for the same model evaluated at the same point [ 35 ] and this is the Savage–Dickey density ratio [ 34 ]. We demonstrate the use of the Savage–Dickey density ratio described in [ 59 ]. We consider fitting a Poisson mixed-effects model to a simulated dataset obtained from Appendix A . We note that the Poisson first level of this example can be changed to many other specifications from the exponential family (e.g., binomial or exponential) with only minor alterations to the code below. With data in the repeated-measures setting, the set of counts obtained from a given subject can be associated. Thus, the standard independence assumption is violated, which is a feature of repeated-measures data.

We utilize the JAGS software and rjags R package [ 60 ] to fit the model and the polspline R package to approximate the log posterior distribution [ 61 ] required to evaluate the Savage–Dickey density ratio.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i005a.jpg

The data are simulated from 48 subjects, and a count is simulated for each of two conditions on each subject. On the log-scale, conditional on the random effects, the mean in condition one is set to α 1 = 2 when the data are simulated and the corresponding value is α 2 = 2.2 for the second condition. Thus, the data are simulated under the alternative hypothesis. After fitting the model to the simulated data, the Bayes factor in favor of the alternative is B F 10 = 25.247 indicating strong evidence in favor of the alternative.

A sensitivity analysis using JAGS or Stan is convenient by passing different parameter values or changing the families of prior distributions. We specified five different prior distributions for the nuisance parameters (the intercept and the precision of the random effects) in the model statement and then examined the Bayes factors computed via the Savage–Dickey density ratio approximation. Four additional combinations of priors are shown in Table 2 . Some variation in the value of the Bayes factor is observed though the conclusion remains stable across these prior specifications.

Prior sensitivity analysis for the Poisson repeated-measures data.

Reportbetatau_b
1dnorm(0, 0.01)dgamma(0.01, 0.01)0.04025.247
2dnorm(0, 0.1)dgamma(0.01, 0.01)0.05418.377
3dnorm(0, 0.01)dgamma(2, 2)0.04224.059
4dnorm(0, 0.1)dgamma(2, 2)0.03230.859
5dnorm(0, 0.5)dgamma(1, 4)0.02342.816

We have addressed the activity of hypothesis testing in light of empirical data. Several issues with the classical p -values and NHST approaches were reviewed to reach researchers who rarely use Bayesian testing, and NHST is still the dominant vehicle for hypothesis testing. We noted that the debate about the overuse of the p -value has been long-lasting, and there are many discussions about the misuse and misinterpretations in the literature.

Following the third principle of the ASA’s statement on p -values—i.e., research practice, business, or policy decisions should not solely rely on a p -value passing an arbitrary threshold—a Bayesian alternative method based on the Bayes factor was introduced, and the advantages and disadvantages of this approach were brought discussed. One possible caveat of the Bayes factor is its numerical computation, which has been mitigated by the advances of computational resources. We reviewed computational methods employed to approximate the marginal likelihoods, such as the bridge sampling estimator, which has an R package implementation available as an open-source solution.

Issues related to prior distributions were discussed, and we recommended a careful choice of priors via elicitation, combined with prior sensitivity analysis when using Bayes factors as a model selection tool. The Bayesian analysis and hypothesis testing are appealing, but going directly from the NHST to Bayesian hypothesis testing may require a challenging leap. Thus, we showed how, using existing software, one can practically implement statistical techniques related to the discussed Bayesian approach, and provided examples of the usage of packages intended to compute the Bayes factor, namely, in applications of the one-sample t -test, multiway ANOVA, repeated-measures designs, and Poisson mixed-effects model.

The Bayes factor is only one of many aspects of Bayesian analysis, and it serves as a bridge to Bayesian inference for researchers interested in testing. The Bayes factor can provide evidence in favor of the null hypothesis and is a relatively intuitive approach for communicating statistical evidence with a meaningful interpretation. The relationships between the Bayes factor and other aspects of the posterior distribution, for example, the overlap of Bayesian highest posterior density intervals, form a topic of interest, and we will report on this issue in another manuscript.

Appendix A. Poisson Repeated-Measures Data Simulation

The sim_Poisson R function returns a repeated-measures data frame in the long format with 2 n rows and three columns. Three columns are subject ID, count response variable, and condition levels.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-00161-i006a.jpg

Author Contributions

Conceptualization, F.S.N.; methodology, Z.W.; software, F.S.N. and Z.W.; validation, M.F.M. and A.Y.; formal analysis, Z.W.; investigation, L.R.; resources, F.S.N.; data curation, Z.W.; writing—original draft preparation, L.R., Z.W. and A.Y.; writing—review and editing, Z.W., F.S.N. and M.F.M.; visualization, Z.W.; supervision, F.S.N. and M.F.M.; project administration, F.S.N.; funding acquisition, F.S.N. and M.F.M. All authors have read and agreed to the published version of the manuscript.

This work was supported by discovery grants to Farouk S. Nathoo and Michelle F. Miranda from the Natural Sciences and Engineering Research Council: RGPIN-2020-06941. Farouk S. Nathoo holds a Tier II Canada Research Chair in Biostatistics.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Data Science from Scratch (ch7) - Hypothesis and Inference

Connecting probability and statistics to hypothesis testing and inference

Table of contents

  • Central Limit Theorem
  • Hypothesis Testing
  • Confidence Intervals
  • Connecting dots with Python

This is a continuation of my progress through Data Science from Scratch by Joel Grus. We’ll use a classic coin-flipping example in this post because it is simple to illustrate with both concept and code . The goal of this post is to connect the dots between several concepts including the Central Limit Theorem, Hypothesis Testing, p-Values and confidence intervals, using python to build our intuition.

Central_Limit_Theorem

Terms like “null” and “alternative” hypothesis are used quite frequently, so let’s set some context. The “null” is the default position. The “alternative”, alt for short, is something we’re comparing to the default (null).

The classic coin-flipping exercise is to test the fairness off a coin. If a coin is fair, it’ll land on heads 50% of the time (and tails 50% of the time). Let’s translate into hypothesis testing language:

Null Hypothesis : Probability of landing on Heads = 0.5.

Alt Hypothesis : Probability of landing on Heads != 0.5.

Each coin flip is a Bernoulli trial , which is an experiment with two outcomes - outcome 1, “success”, (probability p ) and outcome 0, “fail” (probability p - 1 ). The reason it’s a Bernoulli trial is because there are only two outcome with a coin flip (heads or tails). Read more about Bernoulli here .

Here’s the code for a single Bernoulli Trial:

When you sum the independent Bernoulli trials , you get a Binomial(n,p) random variable, a variable whose possible values have a probability distribution. The central limit theorem says as n or the number of independent Bernoulli trials get large, the Binomial distribution approaches a normal distribution.

Here’s the code for when you sum all the Bernoulli Trials to get a Binomial random variable:

Note : A single ‘success’ in a Bernoulli trial is ‘x’. Summing up all those x’s into X, is a Binomial random variable. Success doesn’t imply desirability, nor does “failure” imply undesirability. They’re just terms to count the cases we’re looking for (i.e., number of heads in multiple coin flips to assess a coin’s fairness).

Given that our null is (p = 0.5) and alt is (p != 0.5), we can run some independent bernoulli trials, then sum them up to get a binomial random variable.

independent_coin_flips

Each bernoulli_trial is an experiment with either 0 or 1 as outcomes. The binomial function sums up n bernoulli(0.5) trails. We ran both twice and got different results. Each bernoulli experiment can be a success(1) or faill(0); summing up into a binomial random variable means we’re taking the probability p(0.5) that a coin flips head and we ran the experiment 1,000 times to get a random binomial variable.

The first 1,000 flips we got 510. The second 1,000 flips we got 495. We can repeat this process many times to get a distribution . We can plot this distribution to reinforce our understanding. To this we’ll use binomial_histogram function. This function picks points from a Binomial(n,p) random variable and plots their histogram.

This plot is then rendered:

binomial_coin_fairness

What we did was sum up independent bernoulli_trial (s) of 1,000 coin flips, where the probability of head is p = 0.5, to create a binomial random variable. We then repeated this a large number of times (N = 10,000), then plotted a histogram of the distribution of all binomial random variables. And because we did it so many times, it approximates the standard normal distribution (smooth bell shape curve).

Just to demonstrate how this works, we can generate several binomial random variables:

several_binomial

If we do this 10,000 times, we’ll generate the above histogram. You’ll notice that because we are testing whether the coin is fair, the probability of heads (success) should be at 0.5 and, from 1,000 coin flips, the mean ( mu ) should be a 500.

We have another function that can help us calculate normal_approximation_to_binomial :

When calling the function with our parameters, we get a mean mu of 500 (from 1,000 coin flips) and a standard deviation sigma of 15.8114. Which means that 68% of the time, the binomial random variable will be 500 +/- 15.8114 and 95% of the time it’ll be 500 +/- 31.6228 (see 68-95-99.7 rule )

Hypothesis_Testing

Now that we have seen the results of our “coin fairness” experiment plotted on a binomial distribution (approximately normal), we will be, for the purpose of testing our hypothesis, be interested in the probability of its realized value (binomial random variable) lies within or outside a particular interval .

This means we’ll be interested in questions like:

  • What’s the probability that the binomial(n,p) is below a threshold?
  • Above a threshold?
  • Between an interval?
  • Outside an interval?

First, the normal_cdf (normal cummulative distribution function), which we learned in a previous post , is the probability of a variable being below a certain threshold.

Here, the probability of X (success or heads for a ‘fair coin’) is at 0.5 ( mu = 500, sigma = 15.8113), and we want to find the probability that X falls below 490, which comes out to roughly 26%

On the other hand, the normal_probability_above , probability that X falls above 490 would be 1 - 0.2635 = 0.7365 or roughly 74%.

To make sense of this we need to recall the binomal distribution, that approximates the normal distribution, but we’ll draw a vertical line at 490.

binomial_vline

We’re asking, given the binomal distribution with mu 500 and sigma at 15.8113, what is the probability that a binomal random variable falls below the threshold (left of the line); the answer is approximately 26% and correspondingly falling above the threshold (right of the line), is approximately 74%.

Between interval

We may also wonder what the probability of a binomial random variable falling between 490 and 520 :

binomial_2_vline

Here is the function to calculate this probability and it comes out to approximately 63%. note : Bear in mind the full area under the curve is 1.0 or 100%.

Finally, the area outside of the interval should be 1 - 0.6335 = 0.3665:

In addition to the above, we may also be interested in finding (symmetric) intervals around the mean that account for a certain level of likelihood , for example, 60% probability centered around the mean.

For this operation we would use the inverse_normal_cdf :

First we’d have to find the cutoffs where the upper and lower tails each contain 20% of the probability. We calculate normal_upper_bound and normal_lower_bound and use those to calculate the normal_two_sided_bounds .

So if we wanted to know what the cutoff points were for a 60% probability around the mean and standard deviation ( mu = 500, sigma = 15.8113), it would be between 486.69 and 513.31 .

Said differently, this means roughly 60% of the time, we can expect the binomial random variable to fall between 486 and 513.

Significance and Power

Now that we have a handle on the binomial normal distribution, thresholds (left and right of the mean), and cut-off points, we want to make a decision about significance . Probably the most important part of statistical significance is that it is a decision to be made, not a standard that is externally set.

Significance is a decision about how willing we are to make a type 1 error (false positive), which we explored in a previous post . The convention is to set it to a 5% or 1% willingness to make a type 1 error. Suppose we say 5%.

We would say that out of 1,000 coin flips, 95% of the time, we’d get between 469 and 531 heads on a “fair coin” and 5% of the time, outside of this 469-531 range.

If we recall our hypotheses:

Null Hypothesis : Probability of landing on Heads = 0.5 (fair coin)

Alt Hypothesis : Probability of landing on Heads != 0.5 (biased coin)

Each binomial distribution (test) that consist of 1,000 bernoulli trials, each test where the number of heads falls outside the range of 469-531, we’ll reject the null that the coin is fair. And we’ll be wrong (false positive), 5% of the time. It’s a false positive when we incorrectly reject the null hypothesis, when it’s actually true.

We also want to avoid making a type-2 error (false negative), where we fail to reject the null hypothesis, when it’s actually false.

Note : Its important to keep in mind that terms like significance and power are used to describe tests , in our case, the test of whether a coin is fair or not. Each test is the sum of 1,000 independent bernoulli trials.

For a “test” that has a 95% significance, we’ll assume that out of a 1,000 coin flips, it’ll land on heads between 469-531 times and we’ll determine the coin is fair. For the 5% of the time it lands outside of this range, we’ll determine the coin to be “unfair”, but we’ll be wrong because it actually is fair.

To calculate the power of the test, we’ll take the assumed mu and sigma with a 95% bounds (based on the assumption that the probability of the coin landing on heads is 0.5 or 50% - a fair coin). We’ll determine the lower and upper bounds:

And if the coin was actually biased , we should reject the null, but we fail to. Let’s suppose the actual probability that the coin lands on heads is 55% ( biased towards head):

Using the same range 469 - 531, where the coin is assumed ‘fair’ with mu at 500 and sigma at 15.8113:

95sig_binomial

If the coin, in fact, had a bias towards head (p = 0.55), the distribution would shift right, but if our 95% significance test remains the same, we get:

type2_error

The probability of making a type-2 error is 11.345%. This is the probability that we’re see that the coin’s distribution is within the previous interval 469-531, thinking we should accept the null hypothesis (that the coin is fair), but in actuality, failing to see that the distribution has shifted to the coin having a bias towards heads.

The other way to arrive at this is to find the probability, under the new mu and sigma (new distribution), that X (number of successes) will fall below 531.

So the probability of making a type-2 error or the probability that the new distribution falls below 531 is approximately 11.3%.

The power to detect a type-2 error is 1.00 minus the probability of a type-2 error (1 - 0.113 = 0.887), or 88.7%.

Finally, we may be interested in increasing power to detect a type-2 error. Instead of using a normal_two_sided_bounds function to find the cut-off points (i.e., 469 and 531), we could use a one-sided test that rejects the null hypothesis (‘fair coin’) when X (number of heads on a coin-flip) is much larger than 500.

Here’s the code, using normal_upper_bound :

This means shifting the upper bounds from 531 to 526, providing more probability in the upper tail. This means the probability of a type-2 error goes down from 11.3 to 6.3.

increase_power

And the new (stronger) power to detect type-2 error is 1.0 - 0.064 = 0.936 or 93.6% (up from 88.7% above).

p-Values represent another way of deciding whether to accept or reject the Null Hypothesis. Instead of choosing bounds, thresholds or cut-off points, we could compute the probability, assuming the Null Hypothesis is true, that we would see a value as extreme as the one we just observed.

Here is the code:

If we wanted to compute, assuming we have a “fair coin” ( mu = 500, sigma = 15.8113), what is the probability of seeing a value like 530? ( note : We use 529.5 instead of 530 below due to continuity correction )

Answer: approximately 6.2%

The p-value, 6.2% is higher than our (hypothetical) 5% significance, so we don’t reject the null. On the other hand, if X was slightly more extreme, 532, the probability of seeing that value would be approximately 4.3%, which is less than 5% significance, so we would reject the null.

For one-sided tests, we would use the normal_probability_above and normal_probability_below functions created above:

Under the two_sided_p_values test, the extreme value of 529.5 had a probability of 6.2% of showing up, but not low enough to reject the null hypothesis.

However, with a one-sided test, upper_p_value for the same threshold is now 3.1% and we would reject the null hypothesis.

Confidence_Intervals

A third approach to deciding whether to accept or reject the null is to use confidence intervals. We’ll use the 530 as we did in the p-Values example.

The confidence interval for a coin flipping heads 530 (out 1,000) times is (0.4991, 0.5609). Since this interval contains the p = 0.5 (probability of heads 50% of the time, assuming a fair coin), we do not reject the null.

If the extreme value were more extreme at 540, we would arrive at a different conclusion:

Here we would be 95% confident that the mean of this distribution is contained between 0.5091 and 0.5709 and this does not contain 0.500 (albiet by a slim margin), so we reject the null hypothesis that this is a fair coin.

note : Confidence intervals are about the interval not probability p. We interpret the confidence interval as, if you were to repeat the experiment many times, 95% of the time, the “true” parameter, in our example p = 0.5, would lie within the observed confidence interval.

Connecting_Dots

We used several python functions to build intuition around statistical hypothesis testing. To higlight this “from scratch” aspect of the book here is a diagram tying together the various python function used in this post:

connecting_dots

This post is part of an ongoing series where I document my progress through Data Science from Scratch by Joel Grus .

book_disclaimer

For more content on data science, machine learning, R, Python, SQL and more, find me on Twitter .

Paul Apivat

Paul Apivat

Cryptodata analyst ⛓️.

My interests include data science, machine learning and Python programming.

  • Statistics & Probability in Code
  • Data Science from Scratch (ch6) - Probability
  • How Positive are Your Facebook Posts?
  • Gradient Descent -- Data Science from Scratch (ch8)
  • Data Science from Scratch (ch5) - Statistics

bayesian-testing 0.7.0

pip install bayesian-testing Copy PIP instructions

Released: Jun 29, 2024

Bayesian A/B testing with simple probabilities.

Verified details

Maintainers.

Avatar for Matt525252 from gravatar.com

Unverified details

Project links, github statistics.

  • Open issues:

License: MIT License (MIT)

Author: Matus Baniar

Tags ab testing, bayes, bayesian statistics

Requires: Python <4.0.0, >=3.7.1

Classifiers

  • OSI Approved :: MIT License
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12

Project description

python bayesian hypothesis testing

Bayesian A/B testing

bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.

Implemented tests:

  • Input data - binary data ( [0, 1, 0, ...] )
  • Designed for conversion-like data A/B testing.
  • Input data - normal data with unknown variance
  • Designed for normal data A/B testing.
  • Input data - lognormal data with zeros
  • Designed for revenue-like data A/B testing.
  • Input data - normal data with zeros
  • Designed for profit-like data A/B testing.
  • Input data - categorical data with numerical categories
  • Designed for discrete data A/B testing (e.g. dice rolls, star ratings, 1-10 ratings, etc.).
  • Input data - non-negative integers ( [1, 0, 3, ...] )
  • Designed for poisson data A/B testing.
  • Input data - exponential data (non-negative real numbers)
  • Designed for exponential data A/B testing (e.g. session/waiting time, time between events, etc.).

Implemented evaluation metrics:

  • Expected value from the posterior distribution for a given variant.
  • Probability that a given variant is best among all variants.
  • By default, the best is equivalent to the greatest (from a data/metric point of view), however it is possible to change this by using min_is_best=True in the evaluation method (this can be useful if we try to find the variant while minimizing the tested measure).
  • "Risk" of choosing particular variant over other variants in the test.
  • Measured in the same units as a tested measure (e.g. positive rate or average value).

Probability of Being Best and Expected Loss are calculated using simulations from posterior distributions (considering given data).

Installation

bayesian_testing can be installed using pip:

Alternatively, you can clone the repository and use poetry manually:

Basic Usage

The primary features are classes:

BinaryDataTest

Normaldatatest, deltalognormaldatatest.

  • DeltaNormalDataTest

DiscreteDataTest

Poissondatatest, exponentialdatatest.

All test classes support two methods to insert the data:

  • add_variant_data - Adding raw data for a variant as a list of observations (or numpy 1-D array).
  • add_variant_data_agg - Adding aggregated variant data (this can be practical for a large data, as the aggregation can be done already on a database level).

Both methods for adding data allow specification of prior distributions (see details in respective docstrings). Default prior setup should be sufficient for most of the cases (e.g. cases with unknown priors or large amounts of data).

To get the results of the test, simply call the method evaluate .

Probability of being best and expected loss are approximated using simulations, hence the evaluate method can return slightly different values for different runs. To stabilize it, you can set the sim_count parameter of the evaluate to a higher value (default value is 20K), or even use the seed parameter to fix it completely.

Class for a Bayesian A/B test for the binary-like data (e.g. conversions, successes, etc.).

Class for a Bayesian A/B test for the normal data.

Class for a Bayesian A/B test for the delta-lognormal data (log-normal with zeros). Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue but non-zero values are positive values with possible log-normal distribution. To handle this data, the calculation is combining binary Bayes model for zero vs non-zero "conversions" and log-normal model for non-zero values.

Note : Alternatively, DeltaNormalDataTest can be used for a case when conversions are not necessarily positive values.

Class for a Bayesian A/B test for the discrete data with finite number of numerical categories (states), representing some value. This test can be used for instance for dice rolls data (when looking for the "best" of multiple dice) or rating data (e.g. 1-5 stars or 1-10 scale).

Class for a Bayesian A/B test for the poisson data.

note: Since we set min_is_best=True (because received goals are "bad"), probability and loss are in a favor of variants with lower posterior means.

Class for a Bayesian A/B test for the exponential data.

Development

To set up a development environment, use Poetry and pre-commit :

To be implemented

Additional metrics:

  • Potential Value Remaining
  • Credible Interval
  • bayesian_testing package itself depends only on numpy package.
  • Work on this package (including default priors selection) was inspired mainly by a Coursera course Bayesian Statistics: From Concept to Data Analysis .

Project details

Release history release notifications | rss feed.

Jun 29, 2024

May 18, 2024

Dec 23, 2023

Dec 20, 2023

Nov 12, 2023

Aug 17, 2023

Feb 23, 2023

Dec 26, 2022

Dec 11, 2022

Dec 10, 2022

Aug 5, 2022

Jul 16, 2022

Mar 30, 2022

Feb 21, 2022

Feb 20, 2022

Jan 3, 2022

Jan 2, 2022

Jan 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded Jun 29, 2024 Source

Built Distribution

Uploaded Jun 29, 2024 Python 3

Hashes for bayesian_testing-0.7.0.tar.gz

Hashes for bayesian_testing-0.7.0.tar.gz
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256

Hashes for bayesian_testing-0.7.0-py3-none-any.whl

Hashes for bayesian_testing-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256
  • português (Brasil)

Supported by

python bayesian hypothesis testing

Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, 1 introduction, 2 an example, 2.1 is c faster than python, 2.2 the initial experiment.

Trial #1234567891011121314
C faster?
Trial #1516171819202122232425262728
C faster?
# of Trials ( )28
Frequency[C faster] ( )17
Proportion[C faster] ( )60.71%

2.3 Formulating the Hypotheses

3 the frequentist approach, 3.1 the general process, 3.2 choosing the right test for the example, 3.3 calculating the test statistic, 3.4 interpreting the test statistic.

(observed)0.6071  28   
\( r_0 \) (expected)0.5000 Normal? Odds
7-8 Diff0.1071 Conf. Level0.95  \( H_0 \) 1.0000
StdErr0.0945  -critical1.6449  \( H_1 \) 1.5455
-score1.1339 p-val0.1284 O.R.1.5455
Lower C.L.0.4517 Reject \( H_0 \) ?   

4 The Bayesian Approach

4.1 prior knowledge.

We know from past work that C programs are generally faster than their Java counterparts. The past work shows C is faster than Java between 40% and 80% of the time. We have valid reasons for believing that the C results could be even better relative to Python than they were relative to Java.

python bayesian hypothesis testing

4.2 Some Terminology

4.3 bayes factor, 4.4 interpretation guidance for bayes factor.

python bayesian hypothesis testing

5 Calculating Bayes Factor

5.1 marginal likelihoods for initial experiment.

40%50%60%70%80%
\( P[r] \) 0.20000.20000.20000.20000.2000
\( H_0 \) or \( H_1 \) ? \( H_0 \) \( H_0 \) \( H_1 \) \( H_1 \) \( H_1 \)
\( P[r|H_0] \) 0.50000.50000.00000.00000.0000
\( P[r|H_1] \) 0.00000.00000.33330.33330.3333
      
\( P[H_0] \) 0.4000
\( P[H_1] \) 0.6000
\( \mathrm{Prior\ Odds}_{10} \) 1.5000

5.2 Conditional Likelihoods of an Observation under a Specific Proportion

5.3 evaluating marginal likelihood integrals, 5.4 final step: bayes factor.

python bayesian hypothesis testing

6 Posterior Model

6.1 transforming prior distribution to posterior distribution.

python bayesian hypothesis testing

6.2 Answering Additional Questions Using the Posterior Distribution

6.3 posterior odds, 7 building evidence through replications, 7.1 research trajectory: more datasets, more samples.

StudyAcronym# of trialsFreq. C faster
( )( )
Initial ExperimentExp2817
Replication 1Rep12519
Replication 2Rep2128
Replication 3Rep3129
Replication 4Rep4118
Replication 5Rep5107
Replication 6Rep654

7.2 Analyzing Replications with the Frequentist Approach

Study Freq.Prop.    Param.
# ofCC Effect L.C.Uncert.
trialsfasterfasterSig.SizeRej.Limit(Max.
( )( )( \( r = n/N \) )(p-val)(O.R.) \( H_0 \) ?(L.C.L.) \( - \) L.C.L.)
Exp1281760.71%0.12841.5455No45.17%34.83%
Rep1251976.00%0.00473.1667Yes59.55%20.45%
Rep212866.67%0.12412.0000No42.93%37.07%
Rep312975.00%0.04163.0000Yes51.26%28.74%
Rep411872.73%0.06582.6667No47.93%32.07%
Rep510770.00%0.10302.3333No43.99%36.01%
Rep610880.00%0.02894.0000Yes53.99%26.01%
Rep75480.00%0.1875*4.0000NoN/AN/A

python bayesian hypothesis testing

7.3 What Else Can We Do?

7.4 analyzing replications incrementally with the bayesian approach.

python bayesian hypothesis testing

Study Freq.Prop.   
# ofCC Odds \( _{10} \) Param.
trialsfasterfaster (PosteriorUncert.
( )( )( \( r = n/N \) )BF \( _{10} \) Odds)(Stdev of )
PriorN/AN/AN/AN/A1.561.25%
Exp1281760.71%1.79092.686338.99%
Rep1251976.00%18.715050.27527.82%
Rep212866.67%1.811891.08925.10%
Rep312975.00%3.906355.86122.13%
Rep411872.73%2.96041053.4919.76%
Rep510770.00%2.18152298.2117.97%
Rep610880.00%5.038211578.915.69%
Rep75480.00%2.275426346.114.96%

python bayesian hypothesis testing

Study Freq.Prop.   
# ofCC Odds \( _{10} \) Param.
trialsfasterfaster (PosteriorUncert.
( )( )( \( r = n/N \) )BF \( _{10} \) Odds)(Stdev of )
PriorN/AN/AN/AN/A437.42%
Exp1281760.71%2.07688.307322.53%
Rep1251976.00%11.524395.73523.22%
Rep212866.67%1.8039172.69522.59%
Rep312975.00%3.2797566.39223.45%
Rep411872.73%2.66621510.1023.14%
Rep510770.00%2.07923139.7622.50%
Rep610880.00%4.348813654.320.25%
Rep75480.00%2.153429403.118.76%

8 Conclusions

8.1 further reading, tools, and software engineering applications, 8.2 takeaways.

  • Liu X Zhao Y Xu T Wahab F Sun Y Chen C (2023) Efficient False Positive Control Algorithms in Big Data Mining Applied Sciences 10.3390/app13085006 13 :8 (5006) Online publication date: 16-Apr-2023 https://doi.org/10.3390/app13085006

Index Terms

General and reference

Cross-computing tools and techniques

Empirical studies

Mathematics of computing

Mathematical software

Statistical software

Probability and statistics

Probabilistic inference problems

Bayesian computation

Hypothesis testing and confidence interval computation

Software and its engineering

Software creation and management

Software development process management

Software development techniques

Software verification and validation

Empirical software validation

Recommendations

An introduction to the imprecise dirichlet model for multinomial data.

The imprecise Dirichlet model (IDM) was recently proposed by Walley as a model for objective statistical inference from multinomial data with chances @q. In the IDM, prior or posterior uncertainty about @q is described by a set of Dirichlet ...

Bayesian hypothesis testing in machine learning

Most hypothesis testing in machine learning is done using the frequentist null-hypothesis significance test, which has severe drawbacks. We review recent Bayesian tests which overcome the drawbacks of the frequentist ones.

Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality

Testing for differences between two groups is among the most frequently carried out statistical methods in empirical research. The traditional frequentist approach is to make use of null hypothesis significance tests which use p values to reject a ...

Information

Published in.

cover image ACM Computing Surveys

University of Sydney, Australia

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • Bayesian statistics
  • Bayesian inference
  • Bayesian analysis
  • Bayesian hypothesis testing
  • frequentist analysis
  • frequentist inference
  • null hypothesis significance testing
  • empirical software engineering
  • software engineering research

Contributors

Other metrics, bibliometrics, article metrics.

  • 1 Total Citations View Citations
  • 2,000 Total Downloads
  • Downloads (Last 12 months) 1,123
  • Downloads (Last 6 weeks) 104

View options

View or Download as a PDF file.

View online with eReader .

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

  • FOR INSTRUCTOR
  • FOR INSTRUCTORS

9.1.8 Bayesian Hypothesis Testing

To be more specific, according to the MAP test, we choose $H_0$ if and only if

$\quad$ $H_0$: $X=1$, $\quad$ $H_1$: $X=-1$.

  • in Example 9.10 , we arrived at the following decision rule: We choose $H_0$ if and only if \begin{align} y \geq c, \end{align} where \begin{align} c=\frac{\sigma^2}{2} \ln \left(\frac{1-p}{p}\right). \end{align} Since $Y|H_0 \; \sim \; N(1, \sigma^2)$, \begin{align} P( \textrm{choose }H_1 | H_0)&=P(Y \lt c|H_0)\\ &=\Phi\left(\frac{c-1}{\sigma} \right)\\ &=\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)-\frac{1}{\sigma}\right). \end{align} Since $Y|H_1 \; \sim \; N(-1, \sigma^2)$, \begin{align} P( \textrm{choose }H_0 | H_1)&=P(Y \geq c|H_1)\\ &=1-\Phi\left(\frac{c+1}{\sigma} \right)\\ &=1-\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)+\frac{1}{\sigma}\right). \end{align} Figure 9.4 shows the two error probabilities for this example. Therefore, the average error probability is given by \begin{align} P_e &=P( \textrm{choose }H_1 | H_0) P(H_0)+ P( \textrm{choose }H_0 | H_1) P(H_1)\\ &=p \cdot \Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)-\frac{1}{\sigma}\right)+(1-p) \cdot \left[ 1-\Phi\left(\frac{\sigma}{2} \ln \left(\frac{1-p}{p}\right)+\frac{1}{\sigma}\right)\right]. \end{align}

error-prob-Bayes-Hyp

Minimum Cost Hypothesis Test:

$\quad$ $H_0$: There is no fire, $\quad$ $H_1$: There is a fire.

$\quad$ $C_{10}$: The cost of choosing $H_1$, given that $H_0$ is true. $\quad$ $C_{01}$: The cost of choosing $H_0$, given that $H_1$ is true.

$\quad$ $H_0$: No intruder is present. $\quad$ $H_1$: There is an intruder.

  • First note that \begin{align} P(H_0|y)=1-P(H_1|y)=0.95 \end{align} The posterior risk of accepting $H_1$ is \begin{align} P(H_0|y) C_{10} =0.95 C_{10}. \end{align} We have $C_{01}=10 C_{10}$, so the posterior risk of accepting $H_0$ is \begin{align} P(H_1|y) C_{01} &=(0.05) (10 C_{10})\\ &=0.5 C_{10}. \end{align} Since $P(H_0|y) C_{10} \geq P(H_1|y) C_{01}$, we accept $H_0$, so no alarm message needs to be sent.

The print version of the book is available on .


Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Validation of Bayesian Hypothesis for AB test

I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups:

  • control (A)
  • disabled (B)
  • enabled (C)

Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related to these differences.

Here are some counts:

Now, the control and disabled are the same but I wanted a way to be confident in my A/B/C statistical validation, where my theory was that A=B so there should be no difference/improvement or loss. The A/B test has been running for two weeks and I have:

At this size of sample, the prior does not seem to have a big influence but I am using a beta curve. When comparing A vs B, B vs C and A vs C, I found that:

I don't feel like I can trust any of those results because B is beating A where I would expect the model to not be able to tell (confidence <90% at least), so I must have misunderstood something.

Any help or explanations would be greatly appreciated

mb_SW's user avatar

  • $\begingroup$ A potential sense check: Is your "disabled" group truly equal to your control group, in the sense that they do not go through different computation steps (e.g. client side redirect) to get what you believe are the same user experience? Sections 8 and 9 of this paper explains potential problem arising on this front in more detail. $\endgroup$ –  B.Liu Commented Mar 14, 2022 at 8:55

I assume that you use the independence assumption, but with your sample sizes, I tend to distrust it! There must be some subgroups in your data, you did not give much details --- but maybe country, age, something else. There might be variation of churn rate with some such groups, and if the distribution of such variables are different within your tree treatment groups, that contributes to the differences you have observed. The standard errors computed from the binomial distribution will then be too small ( unobserved heterogeneity ). Some calculations with your data:

so your disabled group is almost midpoint between control and enabled. One idea could be to treat control and disabled as two control groups, and then the difference between them is really caused by the variance (including non-binomial variance) and that could be a basis for an alternative analysis. For now, here is a stored google scholar search for papers about analysis with two control groups. I will look into it ... but out of time now.

EDIT After the question included more information. Since there are additional variables like age range, gender, country, this can be controlled for, and will help to control/estimate (binomial) overdispersion. One way to do it is using a mixed effects logistic regression, which are discussed in many posts at this site, for instance

How can I deal with overdispersion in a logistic (binomial) glm using R? ,

Overdispersion in logistic regression ,

difference between mixed effect logistic regression and logistic regression

kjetil b halvorsen's user avatar

  • $\begingroup$ Thanks for the quick reply. Yes, there is a lot of parameters, age, country, gender, creating potential subgroups but since the creation of the A/B/C groups is randomly sampled, I was kind of hoping the variance between the subgroups to be minimal, otherwise how can we be ever sure that the difference see between B/C is due to the test/feature and not from those subgroup parameters? I will look if there is any answer in the papers you sent. Thanks! $\endgroup$ –  mb_SW Commented Mar 11, 2022 at 7:00
  • $\begingroup$ Please add this new information to the post as an edit, we want to keep all information in posts itself and not spread over comments! That way, also more people will see your post, so it can help you. Otherwise, I think a major problem in this case is overdispersion (relative to the binomial), and including other covariates will help with that, making it possible to estimate overdispersion. $\endgroup$ –  kjetil b halvorsen ♦ Commented Mar 11, 2022 at 13:26
  • 1 $\begingroup$ (+1) in general using the additional information is almost always benefic because if anything it allows us to "explain out some of the variance" due to these additional explanatory features. While small at times, this can also highlight interactions we didn't originally anticipate. $\endgroup$ –  usεr11852 Commented Apr 7, 2022 at 9:41

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged bayesian ab-test or ask your own question .

  • Featured on Meta
  • Announcing a change to the data-dump process
  • We've made changes to our Terms of Service & Privacy Policy - July 2024

Hot Network Questions

  • Is doping still a thing in pro cycling (2024)?
  • Is the phrase "Amazon is having best prices" grammatically correct?
  • Could Swashplate be replaced with small electric motors?
  • is there any way to add amps to a vintage battery charger using an additional transformer?
  • "between" two countries
  • Is this carbon fork damaged?
  • Check if \tikZ \pic is defined or not
  • How do I know when I have sufficiently mastered a right hand speed drill for accordion?
  • What exactly is code and how does it relate to law? Where does it fit into the hierarchy of law?
  • Is the term "terrorism" defined in the international law?
  • How is the name "Took" pronounced?
  • Windows 11, applications take a few seconds to open, but only the first time I open them
  • Represent and evaluate an infinite expression
  • Why is it inadvisable to apologise for a car accident?
  • PCB Footprint Design of a Coin Cell Holder
  • What happens if your flight is cancelled on the last day of your visa; does it vary by country/region?
  • Is it possible to use "sshd_config" to prevent root login only after a user has logged in via ssh?
  • Had there ever been a self-ejecting FDR/CDR?
  • How to reinstall a ceiling fan that fell out of ceiling?
  • Why isn’t this formula used at all?
  • How to make an operator form of Part[] to use with // (Postfix)
  • Eisenstein's last theorem
  • Does Wall of Ice still deal damage after being affected by a Wot4E Monk's Shape the Flowing River?
  • Does a simple and intuitive theory have better chances of being true?

python bayesian hypothesis testing

Back to blog home

Hypothesis testing explained in 4 parts, yuzheng sun, phd.

As data scientists, Hypothesis Testing is expected to be well understood, but often not in reality. It is mainly because our textbooks blend two schools of thought – p-value and significance testing vs. hypothesis testing – inconsistently.

For example, some questions are not obvious unless you have thought through them before:

Are power or beta dependent on the null hypothesis?

Can we accept the null hypothesis? Why?

How does MDE change with alpha holding beta constant?

Why do we use standard error in Hypothesis Testing but not the standard deviation?

Why can’t we be specific about the alternative hypothesis so we can properly model it?

Why is the fundamental tradeoff of the Hypothesis Testing about mistake vs. discovery, not about alpha vs. beta?

Addressing this problem is not easy. The topic of Hypothesis Testing is convoluted. In this article, there are 10 concepts that we will introduce incrementally, aid you with visualizations, and include intuitive explanations. After this article, you will have clear answers to the questions above that you truly understand on a first-principle level and explain these concepts well to your stakeholders.

We break this article into four parts.

Set up the question properly using core statistical concepts, and connect them to Hypothesis Testing, while striking a balance between technically correct and simplicity. Specifically, 

We emphasize a clear distinction between the standard deviation and the standard error, and why the latter is used in Hypothesis Testing

We explain fully when can you “accept” a hypothesis, when shall you say “failing to reject” instead of “accept”, and why

Introduce alpha, type I error, and the critical value with the null hypothesis

Introduce beta, type II error, and power with the alternative hypothesis

Introduce minimum detectable effects and the relationship between the factors with power calculations , with a high-level summary and practical recommendations

Part 1 - Hypothesis Testing, the central limit theorem, population, sample, standard deviation, and standard error

In Hypothesis Testing, we begin with a null hypothesis , which generally asserts that there is no effect between our treatment and control groups. Commonly, this is expressed as the difference in means between the treatment and control groups being zero.

The central limit theorem suggests an important property of this difference in means — given a sufficiently large sample size, the underlying distribution of this difference in means will approximate a normal distribution, regardless of the population's original distribution. There are two notes:

1. The distribution of the population for the treatment and control groups can vary, but the observed means (when you observe many samples and calculate many means) are always normally distributed with a large enough sample. Below is a chart, where the n=10 and n=30 correspond to the underlying distribution of the sample means.

Central Limit Theorem

2. Pay attention to “the underlying distribution”. Standard deviation vs. standard error is a potentially confusing concept. Let’s clarify.

Standard deviation vs. Standard error

Let’s declare our null hypothesis as having no treatment effect. Then, to simplify, let’s propose the following normal distribution with a mean of 0 and a standard deviation of 1 as the range of possible outcomes with probabilities associated with this null hypothesis.

Standard Deviation v Standard Error

The language around population, sample, group, and estimators can get confusing. Again, to simplify, let’s forget that the null hypothesis is about the mean estimator, and declare that we can either observe the mean hypothesis once or many times. When we observe it many times, it forms a sample*, and our goal is to make decisions based on this sample.

* For technical folks, the observation is actually about a single sample, many samples are a group, and the difference in groups is the distribution we are talking about as the mean hypothesis. The red curve represents the distribution of the estimator of this difference, and then we can have another sample consisting of many observations of this estimator. In my simplified language, the red curve is the distribution of the estimator, and the blue curve with sample size is the repeated observations of it. If you have a better way to express these concepts without causing confusiongs, please suggest.

This probability density function means if there is one realization from this distribution, the realitization can be anywhere on the x-axis, with the relative likelihood on the y-axis.

If we draw multiple observations , they form a sample . Each observation in this sample follows the property of this underlying distribution – more likely to be close to 0, and equally likely to be on either side, which makes the odds of positive and negative cancel each other out, so the mean of this sample is even more centered around 0.

We use the standard error to represent the error of our “sample mean” . 

The standard error = the standard deviation of the observed sample / sqrt (sample size). 

For a sample size of 30, the standard error is roughly 0.18. Compared with the underlying distribution, the distribution of the sample mean is much narrower.

Standard Deviation and Standard Error 2 Images

In Hypothesis Testing, we try to draw some conclusions – is there a treatment effect or not? – based on a sample. So when we talk about alpha and beta, which are the probabilities of type I and type II errors , we are talking about the probabilities based on the plot of sample means and standard error .

Part 2, The null hypothesis: alpha and the critical value

From Part 1, we stated that a null hypothesis is commonly expressed as the difference in means between the treatment and control groups being zero.

Without loss of generality*, let’s assume the underlying distribution of our null hypothesis is mean 0 and standard deviation 1

Then the sample mean of the null hypothesis is 0 and the standard error of 1/√ n, where n is the sample size.

When the sample size is 30, this distribution has a standard error of ≈0.18 looks like the below. 

Null Hypothesis YZ

*: A note for the technical readers: The null hypothesis is about the difference in means, but here, without complicating things, we made the subtle change to just draw the distribution of this “estimator of this difference in means”. Everything below speaks to this “estimator”.

The reason we have the null hypothesis is that we want to make judgments, particularly whether a  treatment effect exists. But in the world of probabilities, any observation, and any sample mean can happen, with different probabilities. So we need a decision rule to help us quantify our risk of making mistakes.

The decision rule is, let’s set a threshold. When the sample mean is above the threshold, we reject the null hypothesis; when the sample mean is below the threshold, we accept the null hypothesis.

Accepting a hypothesis vs. failing to reject a hypothesis

It’s worth noting that you may have heard of “we never accept a hypothesis, we just fail to reject a hypothesis” and be subconsciously confused by it. The deep reason is that modern textbooks do an inconsistent blend of Fisher’s significance testing and Neyman-Pearson’s Hypothesis Testing definitions and ignore important caveats ( ref ). To clarify:

First of all, we can never “prove” a particular hypothesis given any observations, because there are infinitely many true hypotheses (with different probabilities) given an observation. We will visualize it in Part 3.

Second, “accepting” a hypothesis does not mean that you believe in it, but only that you act as if it were true. So technically, there is no problem with “accepting” a hypothesis.

But, third, when we talk about p-values and confidence intervals, “accepting” the null hypothesis is at best confusing. The reason is that “the p-value above the threshold” just means we failed to reject the null hypothesis. In the strict Fisher’s p-value framework, there is no alternative hypothesis. While we have a clear criterion for rejecting the null hypothesis (p < alpha), we don't have a similar clear-cut criterion for "accepting" the null hypothesis based on beta.

So the dangers in calling “accepting a hypothesis” in the p-value setting are:

Many people misinterpret “accepting” the null hypothesis as “proving” the null hypothesis, which is wrong; 

“Accepting the null hypothesis” is not rigorously defined, and doesn’t speak to the purpose of the test, which is about whether or not we reject the null hypothesis. 

In this article, we will stay consistent within the Neyman-Pearson framework , where “accepting” a hypothesis is legal and necessary. Otherwise, we cannot draw any distributions without acting as if some hypothesis was true.

You don’t need to know the name Neyman-Pearson to understand anything, but pay attention to our language, as we choose our words very carefully to avoid mistakes and confusion.

So far, we have constructed a simple world of one hypothesis as the only truth, and a decision rule with two potential outcomes – one of the outcomes is “reject the null hypothesis when it is true” and the other outcome is “accept the null hypothesis when it is true”. The likelihoods of both outcomes come from the distribution where the null hypothesis is true.

Later, when we introduce the alternative hypothesis and MDE, we will gradually walk into the world of infinitely many alternative hypotheses and visualize why we cannot “prove” a hypothesis.

We save the distinction between the p-value/significance framework vs. Hypothesis Testing in another article where you will have the full picture.

Type I error, alpha, and the critical value

We’re able to construct a distribution of the sample mean for this null hypothesis using the standard error. Since we only have the null hypothesis as the truth of our universe, we can only make one type of mistake – falsely rejecting the null hypothesis when it is true. This is the type I error , and the probability is called alpha . Suppose we want alpha to be 5%. We can calculate the threshold required to make it happen. This threshold is called the critical value . Below is the chart we further constructed with our sample of 30.

Type I Error Alpha Critical Value

In this chart, alpha is the blue area under the curve. The critical value is 0.3. If our sample mean is above 0.3, we reject the null hypothesis. We have a 5% chance of making the type I error.

Type I error: Falsely rejecting the null hypothesis when the null hypothesis is true

Alpha: The probability of making a Type I error

Critical value: The threshold to determine whether the null hypothesis is to be rejected or not

Part 3, The alternative hypothesis: beta and power

You may have noticed in part 2 that we only spoke to Type I error – rejecting the null hypothesis when it is true. What about the Type II error – falsely accepting the null hypothesis when it is not true?

But it is weird to call “accepting” false unless we know the truth. So we need an alternative hypothesis which serves as the alternative truth. 

Alternative hypotheses are theoretical constructs

There is an important concept that most textbooks fail to emphasize – that is, you can have infinitely many alternative hypotheses for a given null hypothesis, we just choose one. None of them are more special or “real” than the others. 

Let’s visualize it with an example. Suppose we observed a sample mean of 0.51, what is the true alternative hypothesis?

Alternative hypotheses theoretical

With this visualization, you can see why we have “infinitely many alternative hypotheses” because, given the observation, there is an infinite number of alternative hypotheses (plus the null hypothesis) that can be true, each with different probabilities. Some are more likely than others, but all are possible.

Remember, alternative hypotheses are a theoretical construct. We choose one particular alternative hypothesis to calculate certain probabilities. By now, we should have more understanding of why we cannot “accept” the null hypothesis given an observation. We can’t prove that the null hypothesis is true, we just fail to accept it given the observation and our pre-determined decision rule. 

We will fully reconcile this idea of picking one alternative hypothesis out of the world of infinite possibilities when we talk about MDE. The idea of “accept” vs. “fail to reject” is deeper, and we won’t cover it fully in this article. We will do so when we have an article about the p-value and the confidence interval.

Type II error and Beta

For the sake of simplicity and easy comparison, let’s choose an alternative hypothesis with a mean of 0.5, and a standard deviation of

1. Again, with a sample size of 30, the standard error ≈0.18. There are now two potential “truths” in our simple universe.

Type II Error and Beta

Remember from the null hypothesis, we want alpha to be 5% so the corresponding critical value is 0.30. We modify our rule as follows:

If the observation is above 0.30, we reject the null hypothesis and accept the alternative hypothesis ; 

If the observation is below 0.30, we accept the null hypothesis and reject the alternative hypothesis .

Reject alternative and accept null

With the introduction of the alternative hypothesis, the alternative “(hypothesized) truth”, we can call “accepting the null hypothesis and rejecting the alternative hypothesis” a mistake – the Type II error. We can also calculate the probability of this mistake. This is called beta, which is illustrated by the red area below.

Null hypothesis alternative hypothesis

From the visualization, we can see that beta is conditional on the alternative hypothesis and the critical value. Let’s elaborate on these two relationships one by one, very explicitly, as both of them are important.

First, Let’s visualize how beta changes with the mean of the alternative hypothesis by setting another alternative hypothesis where mean = 1 instead of 0.5

Sample Size 30 for Null and Alternative Hypothesis

Beta change from 13.7% to 0.0%. Namely, beta is the probability of falsely rejecting a particular alternative hypothesis when we assume it is true. When we assume a different alternative hypothesis is true, we get a different beta. So strictly speaking, beta only speaks to the probability of falsely rejecting a particular alternative hypothesis when it is true . Nothing else. It’s only under other conditions, that “rejecting the alternative hypothesis” implies “accepting” the null hypothesis or “failing to accept the null hypothesis”. We will further elaborate when we talk about p-value and confidence interval in another article. But what we talked about so far is true and enough for understanding power.

Second, there is a relationship between alpha and beta. Namely, given the null hypothesis and the alternative hypothesis, alpha would determine the critical value, and the critical value determines beta. This speaks to the tradeoff between mistake and discovery. 

If we tolerate more alpha, we will have a smaller critical value, and for the same beta, we can detect a smaller alternative hypothesis

If we tolerate more beta, we can also detect a smaller alternative hypothesis. 

In short, if we tolerate more mistakes (either Type I or Type II), we can detect a smaller true effect. Mistake vs. discovery is the fundamental tradeoff of Hypothesis Testing.

So tolerating more mistakes leads to more chance of discovery. This is the concept of MDE that we will elaborate on in part 4.

Finally, we’re ready to define power. Power is an important and fundamental topic in statistical testing, and we’ll explain the concept in three different ways.

Three ways to understand power

First, the technical definition of power is 1−β. It represents that given an alternative hypothesis and given our null, sample size, and decision rule (alpha = 0.05), the probability is that we accept this particular hypothesis. We visualize the yellow area below.

Understand Power Hypothesis

Second, power is really intuitive in its definition. A real-world example is trying to determine the most popular car manufacturer in the world. If I observe one car and see one brand, my observation is not very powerful. But if I observe a million cars, my observation is very powerful. Powerful tests mean that I have a high chance of detecting a true effect.

Third, to illustrate the two concepts concisely, let’s run a visualization by just changing the sample size from 30 to 100 and see how power increases from 86.3% to almost 100%.

Same size from 30 to 100

As the graph shows, we can easily see that power increases with sample size . The reason is that the distribution of both the null hypothesis and the alternative hypothesis became narrower as their sample means got more accurate. We are less likely to make either a type I error (which reduces the critical value) or a type II error.  

Type II error: Failing to reject the null hypothesis when the alternative hypothesis is true

Beta: The probability of making a type II error

Power: The ability of the test to detect a true effect when it’s there

Part 4, Power calculation: MDE

The relationship between mde, alternative hypothesis, and power.

Now, we are ready to tackle the most nuanced definition of them all: Minimum detectable effect (MDE). First, let’s make the sample mean of the alternative hypothesis explicit on the graph with a red dotted line.

Relationship between MDE

What if we keep the same sample size, but want power to be 80%? This is when we recall the previous chapter that “alternative hypotheses are theoretical constructs”. We can have a different alternative that corresponds to 80% power. After some calculations, we discovered that when it’s the alternative hypothesis with mean = 0.45 (if we keep the standard deviation to be 1).

MDE Alternative Hypothesis pt 2

This is where we reconcile the concept of “infinitely many alternative hypotheses” with the concept of minimum detectable delta. Remember that in statistical testing, we want more power. The “ minimum ” in the “ minimum detectable effect”, is the minimum value of the mean of the alternative hypothesis that would give us 80% power. Any alternative hypothesis with a mean to the right of MDE gives us sufficient power.

In other words, there are indeed infinitely many alternative hypotheses to the right of this mean 0.45. The particular alternative hypothesis with a mean of 0.45 gives us the minimum value where power is sufficient. We call it the minimum detectable effect, or MDE.

Not enough power MDE

The complete definition of MDE from scratch

Let’s go through how we derived MDE from the beginning:

We fixed the distribution of sample means of the null hypothesis, and fixed sample size, so we can draw the blue distribution

For our decision rule, we require alpha to be 5%. We derived that the critical value shall be 0.30 to make 5% alpha happen

We fixed the alternative hypothesis to be normally distributed with a standard deviation of 1 so the standard error is 0.18, the mean can be anywhere as there are infinitely many alternative hypotheses

For our decision rule, we require beta to be 20% or less, so our power is 80% or more. 

We derived that the minimum value of the observed mean of the alternative hypothesis that we can detect with our decision rule is 0.45. Any value above 0.45 would give us sufficient power.

How MDE changes with sample size

Now, let’s tie everything together by increasing the sample size, holding alpha and beta constant, and see how MDE changes.

How MDE changes with sample size

Narrower distribution of the sample mean + holding alpha constant -> smaller critical value from 0.3 to 0.16

+ holding beta constant -> MDE decreases from 0.45 to 0.25

This is the other key takeaway:  The larger the sample size, the smaller of an effect we can detect, and the smaller the MDE.

This is a critical takeaway for statistical testing. It suggests that even for companies not with large sample sizes if their treatment effects are large, AB testing can reliably detect it.

Statistical Power Curve

Summary of Hypothesis Testing

Let’s review all the concepts together.

Assuming the null hypothesis is correct:

Alpha: When the null hypothesis is true, the probability of rejecting it

Critical value: The threshold to determine rejecting vs. accepting the null hypothesis

Assuming an alternative hypothesis is correct:

Beta: When the alternative hypothesis is true, the probability of rejecting it

Power: The chance that a real effect will produce significant results

Power calculation:

Minimum detectable effect (MDE): Given sample sizes and distributions, the minimum mean of alternative distribution that would give us the desired alpha and sufficient power (usually alpha = 0.05 and power >= 0.8)

Relationship among the factors, all else equal: Larger sample, more power; Larger sample, smaller MDE

Everything we talk about is under the Neyman-Pearson framework. There is no need to mention the p-value and significance under this framework. Blending the two frameworks is the inconsistency brought by our textbooks. Clarifying the inconsistency and correctly blending them are topics for another day.

Practical recommendations

That’s it. But it’s only the beginning. In practice, there are many crafts in using power well, for example:

Why peeking introduces a behavior bias, and how to use sequential testing to correct it

Why having multiple comparisons affects alpha, and how to use Bonferroni correction

The relationship between sample size, duration of the experiment, and allocation of the experiment?

Treat your allocation as a resource for experimentation, understand when interaction effects are okay, and when they are not okay, and how to use layers to manage

Practical considerations for setting an MDE

Also, in the above examples, we fixed the distribution, but in reality, the variance of the distribution plays an important role. There are different ways of calculating the variance and different ways to reduce variance, such as CUPED, or stratified sampling.

Related resources:

How to calculate power with an uneven split of sample size: https://blog.statsig.com/calculating-sample-sizes-for-a-b-tests-7854d56c2646

Real-life applications: https://blog.statsig.com/you-dont-need-large-sample-sizes-to-run-a-b-tests-6044823e9992

Create a free account

Statsig for startups.

Statsig offers a generous program for early-stage startups who are scaling fast and need a sophisticated experimentation platform.

Build fast?

Try statsig today.

python bayesian hypothesis testing

Recent Posts

Controlling your type i errors: bonferroni and benjamini-hochberg.

The Benjamini-Hochberg procedure on Statsig reduces false positives in experiments by adjusting significance levels for multiple comparisons, ensuring reliable results.

Top 8 common experimentation mistakes and how to fix them

I discussed 8 A/B testing mistakes with Allon Korem (Bell Statistics) and Tyler VanHaren (Statsig). Learn fixes to improve accuracy and drive better business outcomes.

Introducing Differential Impact Detection

Introducing Differential Impact Detection: Identify how different user groups respond to treatments and gain useful insights from varied experiment results.

Identifying and experimenting with Power Users using Statsig

Identify power users to drive growth and engagement. Learn to pinpoint and leverage these key players with targeted experiments for maximum impact.

How to Ingest Data Into Statsig

Simplify data pipelines with Statsig. Use SDKs, third-party integrations, and Data Warehouse Native Solution for effortless data ingestion at any stage.

A/B Testing performance wins on NestJS API servers

Learn how we use Statsig to enhance our NestJS API servers, reducing request processing time and CPU usage through performance experiments.

Objective Bayesian multiple testing for k normal populations

  • Research Article
  • Published: 29 July 2024

Cite this article

python bayesian hypothesis testing

  • Sang Gil Kang 1 &
  • Yongku Kim   ORCID: orcid.org/0000-0002-3917-6279 2 , 3  

Explore all metrics

This article proposes objective Bayesian multiple testing procedures for a normal model. The challenging task of considering all the configurations of true and false null hypotheses is addressed here by ordering the null hypotheses based on their Bayes factors. This approach reduces the size of the compared models for posterior search from \(2^k\) to \(k+1\) , for k null hypotheses. Furthermore, the consistency of the proposed multiple testing procedures is established and their behavior is analyzed with simulated and real examples. In addition, the proposed procedures are compared with classical and Bayesian multiple testing procedures in all the possible configurations of true and false ordered null hypotheses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

python bayesian hypothesis testing

Objective Bayesian testing for the linear combinations of normal means

python bayesian hypothesis testing

Testing normality of a large number of populations

Bayes minimax competitors of preliminary test estimators in k sample problems, data availability.

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Abramovich, F., & Angelini, C. (2006). Bayesian maximum a posteriori multiple testing procedure. Sankhya, 68 , 436–460.

Google Scholar  

Abramovich, F., Grinshtein, V., & Pensky, M. (2007). On optimality of Bayesian testimation in the normal means problem. The Annals of Statistics, 35 , 2261–2286.

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57 , 289–300.

Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91 , 109–122.

Casella, G., & Moreno, E. (2006). Objective Bayesian variable selection. Journal of the American Statistical Association, 101 , 157–167.

Dunnett, C., & Tamhane, A. C. (1991). Step-down multiple tests for comparing treatments with a control in unbalanced one-way layouts. Statistics in Medicine, 11 , 1057–1063.

Dunnett, C., & Tamhane, A. C. (1992). A step-up multiple test procedure. Journal of the American Statistical Association, 87 , 162–170.

Efron, B. (2003). Robbins, empirical Bayes and micorarrays. The Annals of Statistics, 31 , 366–378.

Efron, B., Tibshirani, R., Storey, J. D., & Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association, 96 , 1151–1160.

Finner, H. (1993). On a monotonicity problem in step-down multiple test procedures. Journal of the American Statistical Association, 88 , 920–923.

Girón, F. J., Martìnez, M. L., Moreno, E., & Torres, F. (2006). Objective testing procedures in linear models. Scandinavian Journal of Statistics, 33 , 765–784.

Hochberg, Y., & Tamhane, A. C. (1987). Multiple comparison procedures . Wiley.

Holm, S. (1999). Multiple confidence sets based on stagewise tests. Journal of the American Statistical Association, 94 , 489–495.

Hsu, J. C. (1996). Multiple comparison: Theory and methods . Chapman & Hall/CRC.

Kang, S. G., Lee, W. D., & Kim, Y. (2022). Objective Bayesian variable selection in the linear regression model. Journal of Statistical Computation and Simulation, 92 , 1133–1157.

Liu, W. (1997). Stepwise tests when the test statistics are independent. The Australian Journal of Statistics, 39 , 169–177.

Moreno, E., Bertolino, F., & Racugno, W. (1998). An intrinsic limiting procedure for model selection and hypotheses testing. Journal of the American Statistical Association, 93 , 1451–1460.

Morris, C. M. (1987). Discussion on ‘Reconciling Bayesian and frequentist evidence in the one-sided testing problem’ (by Casella and Berger). Journal of the American Statistical Association, 82 , 106–111.

Romano, A. (1977). Applied statistics for science and industry . Allyn and Bacon.

Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. The Annals of Statistics, 30 , 239–257.

Sarkar, S. K., & Chen, J. (2004). A Bayesian stepwise multiple testing procedure . Technical Report, Temple University.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64 , 479–498.

Tamhane, A. C., Liu, W., & Dunnett, C. W. (1998). A generalized step-up-down multiple test procedure. The Canadian Journal of Statistics, 26 , 55–68.

Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment . Wiley.

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (RS-2023-00240494) and Learning & Academic research institution for Master’s PhD students, and Postdocs (LAMP) Program of the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (RS-2023-00301914).

Author information

Authors and affiliations.

Department of Computer and Data Information, Sangji University, Wonju, 26339, Korea

Sang Gil Kang

Department of Statistics, Kyungpook National University, Daegu, 41566, Korea

KNU G-LAMP Research Center, Institute of Basic Sciences, Kyungpook National University, Daegu, 41566, Korea

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yongku Kim .

Ethics declarations

Conflict of interest.

We have no conflicts of interest to disclose.

Appendix 1: Proof of Lemma 1

Consider the models \(M_{i1}\) ,

and the models \(M_{i2}\) ,

where where \(c_{i1}\) and \(c_{i2}\) are an arbitrary positive constants, \(\varvec{\theta }_{i1}=\tau _i\) and \(\varvec{\theta }_{i2}=(\mu _i,\sigma _i^2)\) . For the minimal training sample vector \(\mathbf{z}\) , we have

Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{i2}\) becomes

Hence the Lemma 1 is proved.

Appendix 2: A brief summary of intrinsic priors

We start to give a brief summary of the intrinsic Bayes factor and the intrinsic methodology. The key idea is to split the sample \(\mathbf{x}\) into two parts \(\mathbf{x}=(x(l), x(n-l))\) . Part \(x(l)=(x_1,\cdots ,x_l)\) , the training sample, is devoted to converting the improper prior \(\pi _i^N(\theta _i)\) to a proper distribution

where \(m_i^N ( x(l) )=\int f_i (x(l) \vert \theta _i ) \pi _i^N(\theta _i)d\theta _i, i=1,2\) . The Bayes factor is then computed using the remaining data \(x(n-l)\) and \(\pi _i (\theta _i \vert x(l))\) as the prior distribution. The resulting partial Bayes factor is

\(B_{21} (x(n-l) \vert x(l))\) is well defined only if x ( l ) is such that \(0< m_i^N (x(l))<\infty , i=1,2\) . If there is no subsample of x ( l ) for which the second inequality holds, x ( l ) is called a minimal training sample.

The partial Bayes factor depends on the specific training sample x ( l ). To avoid the difficulty of choosing x ( l ), Berger and Pericchi ( 1996 ) proposed the use of a minimal training sample to compute \(B_{21} (x(n-l) \vert x(l))\) . Then, an average over all the possible minimal training samples contained in the sample is computed. This gives the arithmetic intrinsic Bayes factor (AIBF) of \(M_2\) against \(M_1\) as

where L is the number of minimal training samples x ( l ) contained in \(\mathbf{x}\) .

Note that the intrinsic Bayes factor is not actual Bayes factors. Further, stability of the AIBF is a matter of concern. Conceivably, for a given sample \(\mathbf{x}\) , the number of minimal training samples might to be small and minor changes in the data could cause this number to vary substantially. Moreover, the equality \(B_{21}^{AI}(\mathbf{x})=1/B_{12}^{AI}(\mathbf{x})\) is not necessarily satisfied, so that the coherent equality does not hold.

To be coherent, it is important to know whether \(B_{21}^{AI}(\mathbf{x})\) corresponds to an actual Bayes factor for sensible prior. If so, consistency of the \(B_{21}^{AI}(\mathbf{x})\) is assured. With the so-called intrinsic priors, the above question has been answered asymptotically by Berger and Pericchi ( 1996 ). There are priors \(\pi _1^I(\theta _1)\) and \(\pi _2^I(\theta _2)\) for which the corresponding Bayes factor

and \(B_{21}^{AI}(\mathbf{x})\) are asymptotically equivalent under the two models \(M_1\) and \(M_2\) . Note that if we use intrinsic priors for computing the Bayes factor, instead of the improper priors we started from, coherency is assured.

Berger and Pericchi ( 1996 ) showed that intrinsic priors satisfy the functional equations

The expectations in these equations are taken with respect to \(f(x(l)\vert \theta _1)\) and \(f(x(l)\vert \theta _2)\) , respectively, \(\psi _2(\theta _1)\) denotes the limit of the maximum likelihood estimator \(\hat{\theta }_2(\mathbf{x})\) under model \(M_1\) at point \(\theta _1\) , and \(\psi _1(\theta _2)\) denotes the limit of the maximum likelihood estimator \(\hat{\theta }_1(\mathbf{x})\) under model \(M_2\) at point \(\theta _2\) . The main difficulty in considering intrinsic priors is that they might be not unique (Moreno, 1997). In nested models, Eq. ( 21 ) reduce to a single equation with two unknown functions so that it is apparent that the solution is not unique. A procedure for choosing a specific intrinsic prior was given in Moreno et al. ( 1998 ).

Appendix 3: Proof of Lemma 2

We first compute the Bayes factor for comparing model \(M_{i2}\) versus \(M_{i1}\) with the intrinsic prior \(\pi ^I(\varvec{\theta }_{i2})\) . Now

Integrating with respect to \(\tau _i\) in ( 22 ), then we obtain

Next we have

Integrating with respect to \(\mu _i\) in ( 24 ), then we get

where \(s_i^2=\sum _{j=1}^{n_i} (x_{ij}-\bar{x}_i)^2\) . Let \(w=\sigma _i^2\) and \(z={\tau _i^2/\sigma _i^2}\) . Integrating with respect to w in ( 25 ), then we get

Hence the Lemma 2 is proved.

Appendix 4: Proof of Theorem 1

Consider the models \(M_{(i)}\) ,

and the models \(M_{(k)}\) ,

where where \(c_{(i)}\) and \(c_{(k)}\) are an arbitrary positive constants, \(\varvec{\theta }_{(i)}=(\delta _{(1)},\cdots ,\delta _{(i)},\tau _{(1)},\cdots ,\) \(\tau _{(k)})\) and \(\varvec{\theta }_{(k)}=(\mu _{(1)},\cdots ,\mu _{(k)},\sigma _{(1)},\cdots ,\sigma _{(k)})\) . For the minimal training sample vector \(\mathbf{z}\) , we have

Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{(k)}\) becomes

Hence the Theorem 1 is proved.

Appendix 5: Proof of Theorem 3

Consider the models \(M_{(0)}\) ,

and the models \(M_{(i)}\) ,

where where \(c_{(0)}\) and \(c_{(i)}\) are an arbitrary positive constants, \(\varvec{\theta }_{(0)}=(\tau _{(1)},\) \(\cdots ,\tau _{(k)})\) and \(\varvec{\theta }_{(i)}=(\mu _{(1)},\cdots ,\mu _{(i)},\sigma _{(1)},\cdots ,\sigma _{(k)})\) . For the minimal training sample vector \(\mathbf{z}\) , we have

Therefore, the conditional intrinsic prior of \(\varvec{\theta }_{(i)}\) becomes

Hence the Theorem 3 is proved.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Kang, S.G., Kim, Y. Objective Bayesian multiple testing for k normal populations. J. Korean Stat. Soc. (2024). https://doi.org/10.1007/s42952-024-00281-4

Download citation

Received : 15 February 2024

Accepted : 07 July 2024

Published : 29 July 2024

DOI : https://doi.org/10.1007/s42952-024-00281-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Bayes factor
  • Intrinsic prior
  • Model selection
  • Multiple hypothesis testing
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. Hypothesis Testing In Machine Learning While Using Python- Tutorial

    python bayesian hypothesis testing

  2. An Interactive Guide to Hypothesis Testing in Python

    python bayesian hypothesis testing

  3. Bayesian AB Testing using Python

    python bayesian hypothesis testing

  4. Statistical Hypothesis Testing- Data Science with Python

    python bayesian hypothesis testing

  5. Bayesian Inference in Python

    python bayesian hypothesis testing

  6. Bayesian Analysis with Python

    python bayesian hypothesis testing

VIDEO

  1. Bayesian Statistics 11172023

  2. Bayesian Data Science by Simulation

  3. Lecture 7: Bayesian Detection

  4. Hypothesis Testing Using Python : Case Study

  5. Bayesian linear regression, gradient descent linear regression in tamil (unit2)-AL3451 #ML

  6. How to do Bayesian Hyperparameter optimization with Python

COMMENTS

  1. Bayesian Inference in Python: A Comprehensive Guide with Examples

    Let us look at the output of the same. Bayes Theorem Plot Bayes Theorem Output Real-World Example: Predicting Website Conversion Rates with Bayesian Inference. Let us now look at the case of a website.

  2. Bayesian Statistics in Python

    Here we see two important things. First, we see that the prior is substantially wider than the likelihood, which occurs because there is much more data going into the likelihood (1000 data points) compared to the prior (100 data points), and more data reduces our uncertainty.

  3. Chapter 16 Introduction to Bayesian hypothesis testing

    16.1 Hypothesis testing, relative evidence, and the Bayes factor. In the Frequentist null-hypothesis significance testing procedure, we defined a hypothesis test in terms of comparing two nested models, a general MODEL G and a restricted MODEL R which is a special case of MODEL G.

  4. 11 Bayesian hypothesis testing

    This chapter introduces common Bayesian methods of testing what we could call statistical hypotheses.A statistical hypothesis is a hypothesis about a particular model parameter or a set of model parameters.

  5. Bayesian Hypothesis Testing with PyMC

    Lately I've been reading the excellent, open source book Probabilistic Programming and Bayesian Methods for Hackers.The book's prologue opens with the following line. The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis.

  6. A Review of Bayesian Hypothesis Testing and Its Practical

    where larger values of B F 01 represent higher evidence in favor of the null hypothesis.. The Bayes factor can be viewed as a summary of the evidence given by data in support of one hypothesis in contrast to another [7,17].Reporting Bayes factors can be guided by setting customized thresholds according to particular applications.

  7. Data Science from Scratch (ch7)

    Table of contents. Central Limit Theorem; Hypothesis Testing; p-Values; Confidence Intervals; Connecting dots with Python; Overview. This is a continuation of my progress through Data Science from Scratch by Joel Grus.

  8. Bayesian A/B Testing from Scratch

    Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources

  9. 17 Statistical Hypothesis Tests in Python (Cheat Sheet)

    Quick-reference guide to the 17 statistical hypothesis tests that you need in applied machine learning, with sample code in Python. Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project. In this post, you will discover a cheat sheet for the most popular statistical

  10. bayesian-testing · PyPI

    Bayesian A/B testing. bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.. Implemented tests: BinaryDataTest. Input data - binary data ([0, 1, 0, ...]; Designed for conversion-like data A/B testing. NormalDataTest. Input data - normal data with unknown variance; Designed for normal data A/B testing.

  11. Bayesian Thinking in Modern Data Science

    Bayes' Theorem (Source: Eric Castellanos Blog) where: P(A|B) is the posterior probability of the hypothesis. P(B|A) is he likelihood of the evidence given the hypothesis. P(A) is the prior probability of the hypothesis. P(B) is the total probability of the evidence.

  12. Bayesian Hypothesis Testing Illustrated: An Introduction for Software

    Bayesian approaches have been on the radar of the software engineering research community since as early as 1999. Chunali's doctoral work [] applied Bayesian thinking for calibrating cost estimation models with prior expert knowledge.In 2010, Sridrahan and Namin [] delivered a tutorial at the International Conference on Software Engineering to advocate the use of Bayesian approaches for ...

  13. How to Perform Hypothesis Testing in Python (With Examples)

    A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.. This tutorial explains how to perform the following hypothesis tests in Python: One sample t-test; Two sample t-test; Paired samples t-test; Let's jump in!

  14. Bayesian Hypothesis Testing

    Example Suppose that the random variable $X$ is transmitted over a communication channel. Assume that the received signal is given by \begin{align} Y=X+W, \end{align ...

  15. 2020-05-28-01-Introduction-to-hypothesis-testing.ipynb

    Permutation sampling is a great way to simulate the hypothesis that two variables have identical probability distributions. This is often a hypothesis you want to test, so in this exercise, you will write a function to generate a permutation sample from two data sets.

  16. Validation of Bayesian Hypothesis for AB test

    I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups: control (A) disabled (B) enabled (C) Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related ...

  17. Hypothesis Testing explained in 4 parts

    As data scientists, Hypothesis Testing is expected to be well understood, but often not in reality. It is mainly because our textbooks blend two schools of thought - p-value and significance testing vs. hypothesis testing - inconsistently.

  18. Design analysis is not just about statistical significance and power

    I think this frequentist idea is really important even when using Bayes in fields like psychology and linguistics, because our experiments are in fact replicable and should have the expected frequentist properties.

  19. Objective Bayesian multiple testing for k normal populations

    Within this set, we restrict our search for the most plausible family of true and false null hypotheses. In the objective Bayesian approach to multiple testing, the underlying nonnested models must be encompassed in one of two ways: by encompassing all the models into the full model with all the possible alternative hypotheses or by encompassing the simple model with only the null hypotheses ...