Teach yourself statistics

Hypothesis Test of a Proportion (Small Sample)

This lesson explains how to test a hypothesis about a proportion when a simple random sample has fewer than 10 successes or 10 failures - a situation that often occurs with small samples. (In a previous lesson , we showed how to conduct a hypothesis test for a proportion when a simple random sample includes at least 10 successes and 10 failures.)

The approach described in this lesson is appropriate, as long as the sample includes at least one success and one failure. The key steps are:

  • Formulate the hypotheses to be tested. This means stating the null hypothesis and the alternative hypothesis .
  • Determine the sampling distribution of the proportion. If the sample proportion is the outcome of a binomial experiment , the sampling distribution will be binomial. If it is the outcome of a hypergeometric experiment , the sampling distribution will be hypergeometric.
  • Specify the significance level . (Researchers often set the significance level equal to 0.05 or 0.01, although other values may be used.)
  • Based on the hypotheses, the sampling distribution, and the significance level, define the region of acceptance .
  • Test the null hypothesis. If the sample proportion falls within the region of acceptance, do not reject the null hypothesis; otherwise, reject the null hypothesis.

The following examples illustrate how to test hypotheses with small samples. The first example involves a binomial experiment; and the second example, a hypergeometric experiment.

Example 1: Sampling With Replacement

Suppose an urn contains 30 marbles. Some marbles are red, and the rest are green. A researcher hypothesizes that the urn contains 15 or more red marbles. The researcher randomly samples five marbles, with replacement , from the urn. Two of the selected marbles are red, and three are green. Based on the sample results, should the researcher reject the null hypothesis? Use a significance level of 0.20.

Solution: There are five steps in conducting a hypothesis test, as described in the previous section. We work through each of the five steps below:

Null hypothesis: P >= 0.50

Alternative hypothesis: P < 0.50

Given those inputs (a binomial distribution where the true population proportion is equal to 0.50), the sampling distribution of the proportion can be determined. It appears in the table below, which shows individual probabilities for single events and cumulative probabilities for multiple events. (Elsewhere on this website, we showed how to compute binomial probabilities that form the body of the table.)

Number of red marbles in sample Sample prop Prob Cumulative probability
0 0.0 0.03125 0.03125
1 0.2 0.15625 0.1875
2 0.4 0.3125 0.5
3 0.6 0.3125 0.8125
4 0.8 0.15625 0.96875
5 1.0 0.03125 1.00
  • Specify significance level . The significance level was set at 0.20. (This means that the probability of making a Type I error is 0.20, assuming that the null hypothesis is true.)

However, we can define a region of acceptance for which the significance level would be no more than 0.20. From the table, we see that if the true population proportion is equal to 0.50, we would be very unlikely to pick 0 or 1 red marble in our sample of 5 marbles. The probability of selecting 1 or 0 red marbles would be 0.1875. Therefore, if we let the significance level equal 0.1875, we can define the region of rejection as any sampled outcome that includes only 0 or 1 red marble (i.e., a sampled proportion equal to 0 or 0.20). We can define the region of acceptance as any sampled outcome that includes at least 2 red marbles. This is equivalent to a sampled proportion that is greater than or equal to 0.40.

  • Test the null hypothesis . Since the sample proportion (0.40) is within the region of acceptance, we cannot reject the null hypothesis.

Example 2: Sampling Without Replacement

The Acme Advertising company has 25 clients. Account executives at Acme claim that 80 percent of these clients are very satisfied with the service they receive. To test that claim, Acme's CEO commissions a survey of 10 clients. Survey participants are randomly sampled, without replacement , from the client population. Six of the ten sampled customers (i.e., 60 percent) say that they are very satisfied. Based on the sample results, should the CEO accept or reject the hypothesis that 80 percent of Acme's clients are very satisfied. Use a significance level of 0.10.

Null hypothesis: P >= 0.80

Alternative hypothesis: P < 0.80

Given those inputs (a hypergeometric distribution where 20 of 25 clients are very satisfied), the sampling distribution of the proportion can be determined. It appears in the table below, which shows individual probabilities for single events and cumulative probabilities for multiple events. (Elsewhere on this website, we showed how to compute hypergeometric probabilities that form the body of the table.)

Number of satisfied clients in sample Sample prop Prob Cumulative probability
4 or less 0.4 or less 0.00 0.00
5 0.5 0.00474 0.00474
6 0.6 0.05929 0.06403
7 0.7 0.23715 0.30119
8 0.8 0.38538 0.68656
9 0.9 0.25692 0.94348
10 1.0 0.05652 1.00
  • Specify significance level . The significance level was set at 0.10. (This means that the probability of making a Type I error is 0.10, assuming that the null hypothesis is true.)

However, we can define a region of acceptance for which the significance level would be no more than 0.10. From the table, we see that if the true proportion of very satisfied clients is equal to 0.80, we would be very unlikely to have fewer than 7 very satisfied clients in our sample. The probability of having 6 or fewer very satisfied clients in the sample would be 0.064. Therefore, if we let the significance level equal 0.064, we can define the region of rejection as any sampled outcome that includes 6 or fewer very satisfied customers. We can define the region of acceptance as any sampled outcome that includes 7 or more very satisfied customers. This is equivalent to a sample proportion that is greater than or equal to 0.70.

  • Test the null hypothesis . Since the sample proportion (0.60) is outside the region of acceptance, we cannot accept the null hypothesis at the 0.064 level of significance.

MeasuringU Logo

Best Practices for Using Statistics on Small Sample Sizes

small sample size hypothesis tests

Put simply, this is wrong, but it’s a common misconception .

There are appropriate statistical methods to deal with small sample sizes.

Although one researcher’s “small” is another’s large, when I refer to small sample sizes I mean studies that have typically between 5 and 30 users total—a size very common in usability studies .

But user research isn’t the only field that deals with small sample sizes. Studies involving fMRIs, which cost a lot to operate, have limited sample sizes as well [pdf] as do studies using laboratory animals.

While there are equations that allow us to properly handle small “n” studies, it’s important to know that there are limitations to these smaller sample studies: you are limited to seeing big differences or big “effects.”

To put it another way, statistical analysis with small samples is like making astronomical observations with binoculars . You are limited to seeing big things: planets, stars, moons and the occasional comet.  But just because you don’t have access to a high-powered telescope doesn’t mean you cannot conduct astronomy. Galileo, in fact, discovered Jupiter’s moons with a telescope with the same power as many of today’s binoculars .

Just as with statistics, just because you don’t have a large sample size doesn’t mean you cannot use statistics. Again, the key limitation is that you are limited to detecting large differences between designs or measures.

Fortunately, in user-experience research we are often most concerned about these big differences—differences users are likely to notice, such as changes in the navigation structure or the improvement of a search results page.

Here are the procedures which we’ve tested for common, small-sample user research, and we will cover them all at the UX Boot Camp in Denver next month.

If you need to compare completion rates, task times, and rating scale data for two independent groups, there are two procedures you can use for small and large sample sizes.  The right one depends on the type of data you have: continuous or discrete-binary.

Comparing Means : If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test . It’s been shown to be accurate for small sample sizes.

Comparing Two Proportions : If your data is binary (pass/fail, yes/no), then use the N-1 Two Proportion Test. This is a variation on the better known Chi-Square test (it is algebraically equivalent to the N-1 Chi-Square test). When expected cell counts fall below one, the Fisher Exact Test tends to perform better. The online calculator handles this for you and we discuss the procedure in Chapter 5 of Quantifying the User Experience .

Confidence Intervals

When you want to know what the plausible range is for the user population from a sample of data, you’ll want to generate a confidence interval . While the confidence interval width will be rather wide (usually 20 to 30 percentage points), the upper or lower boundary of the intervals can be very helpful in establishing how often something will occur in the total user population.

For example, if you wanted to know if users would read a sheet that said “Read this first” when installing a printer, and six out of eight users didn’t read the sheet in an installation study, you’d know that at least 40% of all users would likely do this –a substantial proportion.

There are three approaches to computing confidence intervals based on whether your data is binary, task-time or continuous.

Confidence interval around a mean : If your data is generally continuous (not binary) such as rating scales, order amounts in dollars, or the number of page views, the confidence interval is based on the t-distribution (which takes into account sample size).

Confidence interval around task-time :  Task time data is positively skewed . There is a lower boundary of 0 seconds. It’s not uncommon for some users to take 10 to 20 times longer than other users to complete the same task. To handle this skew, the time data needs to be log-transformed   and the confidence interval is computed on the log-data, then transformed back when reporting. The online calculator handles all this.

Confidence interval around a binary measure: For an accurate confidence interval around binary measures like completion rate or yes/no questions, the Adjusted Wald interval performs well for all sample sizes.

Point Estimates (The Best Averages)

The “best” estimate for reporting an average time or average completion rate for any study may vary depending on the study goals.  Keep in mind that even the “best” single estimate will still differ from the actual average, so using confidence intervals provides a better method for estimating the unknown population average.

For the best overall average for small sample sizes, we have two recommendations for task-time and completion rates, and a more general recommendation for all sample sizes for rating scales.

Completion Rate : For small-sample completion rates, there are only a few possible values for each task. For example, with five users attempting a task, the only possible outcomes are 0%, 20%, 40%, 60%, 80% and 100% success. It’s not uncommon to have 100% completion rates with five users. There’s something about reporting perfect success at this sample size that doesn’t resonate well. It sounds too good to be true.

We experimented [pdf] with several estimators with small sample sizes and found the LaPlace estimator and the simple proportion (referred to as the Maximum Likelihood Estimator) generally work well for the usability test data we examined. When you want the best estimate, the calculator will generate it based on our findings.

Rating Scales : Rating scales are a funny type of metric, in that most of them are bounded on both ends (e.g. 1 to 5, 1 to 7 or 1 to 10) unless you are Spinal Tap of course. For small and large sample sizes, we’ve found reporting the mean to be the best average over the median [pdf] . There are in fact many ways to report the scores from rating scales, including top-two boxes . The one you report depends on both the sensitivity as well as what’s used in an organization.

Average Time : One long task time can skew the arithmetic mean and make it a poor measure of the middle. In such situations, the median is a better indicator of the typical or “average” time. Unfortunately, the median tends to be less accurate and more biased than the mean when sample sizes are less than about 25. In these circumstances, the geometric mean (average of the log values transformed back) tends to be a better measure of the middle. When sample sizes get above 25, the median works fine.

You might also be interested in

feature image

small sample size hypothesis tests

8.4 Small Sample Tests for a Population Mean

Learning objective.

  • To learn how to apply the five-step test procedure for test of hypotheses concerning a population mean when the sample size is small.

In the previous section hypotheses testing for population means was described in the case of large samples. The statistical validity of the tests was insured by the Central Limit Theorem, with essentially no assumptions on the distribution of the population. When sample sizes are small, as is often the case in practice, the Central Limit Theorem does not apply. One must then impose stricter assumptions on the population to give statistical validity to the test procedure. One common assumption is that the population from which the sample is taken has a normal probability distribution to begin with. Under such circumstances, if the population standard deviation is known, then the test statistic ( x - − μ 0 ) ∕ ( σ ∕ n ) still has the standard normal distribution, as in the previous two sections. If σ is unknown and is approximated by the sample standard deviation s , then the resulting test statistic ( x - − μ 0 ) ∕ ( s ∕ n ) follows Student’s t -distribution with n − 1 degrees of freedom.

Standardized Test Statistics for Small Sample Hypothesis Tests Concerning a Single Population Mean

The first test statistic ( σ known) has the standard normal distribution.

The second test statistic ( σ unknown) has Student’s t -distribution with n − 1 degrees of freedom.

The population must be normally distributed.

The distribution of the second standardized test statistic (the one containing s ) and the corresponding rejection region for each form of the alternative hypothesis (left-tailed, right-tailed, or two-tailed), is shown in Figure 8.11 "Distribution of the Standardized Test Statistic and the Rejection Region" . This is just like Figure 8.4 "Distribution of the Standardized Test Statistic and the Rejection Region" , except that now the critical values are from the t -distribution. Figure 8.4 "Distribution of the Standardized Test Statistic and the Rejection Region" still applies to the first standardized test statistic (the one containing σ ) since it follows the standard normal distribution.

Figure 8.11 Distribution of the Standardized Test Statistic and the Rejection Region

small sample size hypothesis tests

The p -value of a test of hypotheses for which the test statistic has Student’s t -distribution can be computed using statistical software, but it is impractical to do so using tables, since that would require 30 tables analogous to Figure 12.2 "Cumulative Normal Probability" , one for each degree of freedom from 1 to 30. Figure 12.3 "Critical Values of " can be used to approximate the p -value of such a test, and this is typically adequate for making a decision using the p -value approach to hypothesis testing, although not always. For this reason the tests in the two examples in this section will be made following the critical value approach to hypothesis testing summarized at the end of Section 8.1 "The Elements of Hypothesis Testing" , but after each one we will show how the p -value approach could have been used.

The price of a popular tennis racket at a national chain store is $179. Portia bought five of the same racket at an online auction site for the following prices:

Assuming that the auction prices of rackets are normally distributed, determine whether there is sufficient evidence in the sample, at the 5% level of significance, to conclude that the average price of the racket is less than $179 if purchased at an online auction.

Step 1. The assertion for which evidence must be provided is that the average online price μ is less than the average price in retail stores, so the hypothesis test is

Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is

and has the Student t -distribution with n − 1 = 5 − 1 = 4 degrees of freedom.

Step 3. From the data we compute x - = 169 and s = 10.39. Inserting these values into the formula for the test statistic gives

  • Step 4. Since the symbol in H a is “<” this is a left-tailed test, so there is a single critical value, − t α = − t 0.05 [ d f = 4 ] . Reading from the row labeled d f = 4 in Figure 12.3 "Critical Values of " its value is −2.132. The rejection region is ( − ∞ , − 2.132 ] .

Step 5. As shown in Figure 8.12 "Rejection Region and Test Statistic for " the test statistic falls in the rejection region. The decision is to reject H 0 . In the context of the problem our conclusion is:

The data provide sufficient evidence, at the 5% level of significance, to conclude that the average price of such rackets purchased at online auctions is less than $179.

Figure 8.12 Rejection Region and Test Statistic for Note 8.42 "Example 10"

small sample size hypothesis tests

To perform the test in Note 8.42 "Example 10" using the p -value approach, look in the row in Figure 12.3 "Critical Values of " with the heading d f = 4 and search for the two t -values that bracket the unsigned value 2.152 of the test statistic. They are 2.132 and 2.776, in the columns with headings t 0.050 and t 0.025 . They cut off right tails of area 0.050 and 0.025, so because 2.152 is between them it must cut off a tail of area between 0.050 and 0.025. By symmetry −2.152 cuts off a left tail of area between 0.050 and 0.025, hence the p -value corresponding to t = − 2.152 is between 0.025 and 0.05. Although its precise value is unknown, it must be less than α = 0.05 , so the decision is to reject H 0 .

A small component in an electronic device has two small holes where another tiny part is fitted. In the manufacturing process the average distance between the two holes must be tightly controlled at 0.02 mm, else many units would be defective and wasted. Many times throughout the day quality control engineers take a small sample of the components from the production line, measure the distance between the two holes, and make adjustments if needed. Suppose at one time four units are taken and the distances are measured as

Determine, at the 1% level of significance, if there is sufficient evidence in the sample to conclude that an adjustment is needed. Assume the distances of interest are normally distributed.

Step 1. The assumption is that the process is under control unless there is strong evidence to the contrary. Since a deviation of the average distance to either side is undesirable, the relevant test is

where μ denotes the mean distance between the holes.

and has the Student t -distribution with n − 1 = 4 − 1 = 3 degrees of freedom.

Step 3. From the data we compute x - = 0.02075 and s = 0.00171. Inserting these values into the formula for the test statistic gives

  • Step 4. Since the symbol in H a is “≠” this is a two-tailed test, so there are two critical values, ± t α ∕ 2 = − t 0.005 [ d f = 3 ] . Reading from the row in Figure 12.3 "Critical Values of " labeled d f = 3 their values are ± 5.841 . The rejection region is ( − ∞ , − 5.841 ] ∪ [ 5.841 , ∞ ) .

Step 5. As shown in Figure 8.13 "Rejection Region and Test Statistic for " the test statistic does not fall in the rejection region. The decision is not to reject H 0 . In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the 1% level of significance, to conclude that the mean distance between the holes in the component differs from 0.02 mm.

Figure 8.13 Rejection Region and Test Statistic for Note 8.43 "Example 11"

small sample size hypothesis tests

To perform the test in Note 8.43 "Example 11" using the p -value approach, look in the row in Figure 12.3 "Critical Values of " with the heading d f = 3 and search for the two t -values that bracket the value 0.877 of the test statistic. Actually 0.877 is smaller than the smallest number in the row, which is 0.978, in the column with heading t 0.200 . The value 0.978 cuts off a right tail of area 0.200, so because 0.877 is to its left it must cut off a tail of area greater than 0.200. Thus the p -value, which is the double of the area cut off (since the test is two-tailed), is greater than 0.400. Although its precise value is unknown, it must be greater than α = 0.01 , so the decision is not to reject H 0 .

Key Takeaways

  • There are two formulas for the test statistic in testing hypotheses about a population mean with small samples. One test statistic follows the standard normal distribution, the other Student’s t -distribution.
  • The population standard deviation is used if it is known, otherwise the sample standard deviation is used.
  • Either five-step procedure, critical value or p -value approach, is used with either test statistic.

Find the rejection region (for the standardized test statistic) for each hypothesis test based on the information given. The population is normally distributed.

  • H 0 : μ = 27 vs. H a : μ < 27 @ α = 0.05 , n = 12, σ = 2.2.
  • H 0 : μ = 52 vs. H a : μ ≠ 52 @ α = 0.05 , n = 6, σ unknown.
  • H 0 : μ = − 105 vs. H a : μ > − 105 @ α = 0.10 , n = 24, σ unknown.
  • H 0 : μ = 78.8 vs. H a : μ ≠ 78.8 @ α = 0.10 , n = 8, σ = 1.7.
  • H 0 : μ = 17 vs. H a : μ < 17 @ α = 0.01 , n = 26, σ = 0.94.
  • H 0 : μ = 880 vs. H a : μ ≠ 880 @ α = 0.01 , n = 4, σ unknown.
  • H 0 : μ = − 12 vs. H a : μ > − 12 @ α = 0.05 , n = 18, σ = 1.1.
  • H 0 : μ = 21.1 vs. H a : μ ≠ 21.1 @ α = 0.05 , n = 23, σ unknown.

Find the rejection region (for the standardized test statistic) for each hypothesis test based on the information given. The population is normally distributed. Identify the test as left-tailed, right-tailed, or two-tailed.

  • H 0 : μ = 141 vs. H a : μ < 141 @ α = 0.20 , n = 29, σ unknown.
  • H 0 : μ = − 54 vs. H a : μ < − 54 @ α = 0.05 , n = 15, σ = 1.9.
  • H 0 : μ = 98.6 vs. H a : μ ≠ 98.6 @ α = 0.05 , n = 12, σ unknown.
  • H 0 : μ = 3.8 vs. H a : μ > 3.8 @ α = 0.001 , n = 27, σ unknown.
  • H 0 : μ = − 62 vs. H a : μ ≠ − 62 @ α = 0.005 , n = 8, σ unknown.
  • H 0 : μ = 73 vs. H a : μ > 73 @ α = 0.001 , n = 22, σ unknown.
  • H 0 : μ = 1124 vs. H a : μ < 1124 @ α = 0.001 , n = 21, σ unknown.
  • H 0 : μ = 0.12 vs. H a : μ ≠ 0.12 @ α = 0.001 , n = 14, σ = 0.026.

A random sample of size 20 drawn from a normal population yielded the following results: x - = 49.2 , s = 1.33.

  • Test H 0 : μ = 50 vs. H a : μ ≠ 50 @ α = 0.01 .
  • Estimate the observed significance of the test in part (a) and state a decision based on the p -value approach to hypothesis testing.

A random sample of size 16 drawn from a normal population yielded the following results: x - = − 0.96 , s = 1.07.

  • Test H 0 : μ = 0 vs. H a : μ < 0 @ α = 0.001 .

A random sample of size 8 drawn from a normal population yielded the following results: x - = 289 , s = 46.

  • Test H 0 : μ = 250 vs. H a : μ > 250 @ α = 0.05 .

A random sample of size 12 drawn from a normal population yielded the following results: x - = 86.2 , s = 0.63.

  • Test H 0 : μ = 85.5 vs. H a : μ ≠ 85.5 @ α = 0.01 .

Applications

Researchers wish to test the efficacy of a program intended to reduce the length of labor in childbirth. The accepted mean labor time in the birth of a first child is 15.3 hours. The mean length of the labors of 13 first-time mothers in a pilot program was 8.8 hours with standard deviation 3.1 hours. Assuming a normal distribution of times of labor, test at the 10% level of significance test whether the mean labor time for all women following this program is less than 15.3 hours.

A dairy farm uses the somatic cell count (SCC) report on the milk it provides to a processor as one way to monitor the health of its herd. The mean SCC from five samples of raw milk was 250,000 cells per milliliter with standard deviation 37,500 cell/ml. Test whether these data provide sufficient evidence, at the 10% level of significance, to conclude that the mean SCC of all milk produced at the dairy exceeds that in the previous report, 210,250 cell/ml. Assume a normal distribution of SCC.

Six coins of the same type are discovered at an archaeological site. If their weights on average are significantly different from 5.25 grams then it can be assumed that their provenance is not the site itself. The coins are weighed and have mean 4.73 g with sample standard deviation 0.18 g. Perform the relevant test at the 0.1% (1/10th of 1%) level of significance, assuming a normal distribution of weights of all such coins.

An economist wishes to determine whether people are driving less than in the past. In one region of the country the number of miles driven per household per year in the past was 18.59 thousand miles. A sample of 15 households produced a sample mean of 16.23 thousand miles for the last year, with sample standard deviation 4.06 thousand miles. Assuming a normal distribution of household driving distances per year, perform the relevant test at the 5% level of significance.

The recommended daily allowance of iron for females aged 19–50 is 18 mg/day. A careful measurement of the daily iron intake of 15 women yielded a mean daily intake of 16.2 mg with sample standard deviation 4.7 mg.

  • Assuming that daily iron intake in women is normally distributed, perform the test that the actual mean daily intake for all women is different from 18 mg/day, at the 10% level of significance.
  • The sample mean is less than 18, suggesting that the actual population mean is less than 18 mg/day. Perform this test, also at the 10% level of significance. (The computation of the test statistic done in part (a) still applies here.)

The target temperature for a hot beverage the moment it is dispensed from a vending machine is 170°F. A sample of ten randomly selected servings from a new machine undergoing a pre-shipment inspection gave mean temperature 173°F with sample standard deviation 6.3°F.

  • Assuming that temperature is normally distributed, perform the test that the mean temperature of dispensed beverages is different from 170°F, at the 10% level of significance.
  • The sample mean is greater than 170, suggesting that the actual population mean is greater than 170°F. Perform this test, also at the 10% level of significance. (The computation of the test statistic done in part (a) still applies here.)

The average number of days to complete recovery from a particular type of knee operation is 123.7 days. From his experience a physician suspects that use of a topical pain medication might be lengthening the recovery time. He randomly selects the records of seven knee surgery patients who used the topical medication. The times to total recovery were:

  • Assuming a normal distribution of recovery times, perform the relevant test of hypotheses at the 10% level of significance.
  • Would the decision be the same at the 5% level of significance? Answer either by constructing a new rejection region (critical value approach) or by estimating the p -value of the test in part (a) and comparing it to α .

A 24-hour advance prediction of a day’s high temperature is “unbiased” if the long-term average of the error in prediction (true high temperature minus predicted high temperature) is zero. The errors in predictions made by one meteorological station for 20 randomly selected days were:

  • Assuming a normal distribution of errors, test the null hypothesis that the predictions are unbiased (the mean of the population of all errors is 0) versus the alternative that it is biased (the population mean is not 0), at the 1% level of significance.
  • Would the decision be the same at the 5% level of significance? The 10% level of significance? Answer either by constructing new rejection regions (critical value approach) or by estimating the p -value of the test in part (a) and comparing it to α .

Pasteurized milk may not have a standardized plate count (SPC) above 20,000 colony-forming bacteria per milliliter (cfu/ml). The mean SPC for five samples was 21,500 cfu/ml with sample standard deviation 750 cfu/ml. Test the null hypothesis that the mean SPC for this milk is 20,000 versus the alternative that it is greater than 20,000, at the 10% level of significance. Assume that the SPC follows a normal distribution.

One water quality standard for water that is discharged into a particular type of stream or pond is that the average daily water temperature be at most 18°C. Six samples taken throughout the day gave the data:

The sample mean x - = 18.15 exceeds 18, but perhaps this is only sampling error. Determine whether the data provide sufficient evidence, at the 10% level of significance, to conclude that the mean temperature for the entire day exceeds 18°C.

Additional Exercises

A calculator has a built-in algorithm for generating a random number according to the standard normal distribution. Twenty-five numbers thus generated have mean 0.15 and sample standard deviation 0.94. Test the null hypothesis that the mean of all numbers so generated is 0 versus the alternative that it is different from 0, at the 20% level of significance. Assume that the numbers do follow a normal distribution.

At every setting a high-speed packing machine delivers a product in amounts that vary from container to container with a normal distribution of standard deviation 0.12 ounce. To compare the amount delivered at the current setting to the desired amount 64.1 ounce, a quality inspector randomly selects five containers and measures the contents of each, obtaining sample mean 63.9 ounces and sample standard deviation 0.10 ounce. Test whether the data provide sufficient evidence, at the 5% level of significance, to conclude that the mean of all containers at the current setting is less than 64.1 ounces.

A manufacturing company receives a shipment of 1,000 bolts of nominal shear strength 4,350 lb. A quality control inspector selects five bolts at random and measures the shear strength of each. The data are:

  • Assuming a normal distribution of shear strengths, test the null hypothesis that the mean shear strength of all bolts in the shipment is 4,350 lb versus the alternative that it is less than 4,350 lb, at the 10% level of significance.
  • Estimate the p -value (observed significance) of the test of part (a).
  • Compare the p -value found in part (b) to α = 0.10 and make a decision based on the p -value approach. Explain fully.

A literary historian examines a newly discovered document possibly written by Oberon Theseus. The mean average sentence length of the surviving undisputed works of Oberon Theseus is 48.72 words. The historian counts words in sentences between five successive 101 periods in the document in question to obtain a mean average sentence length of 39.46 words with standard deviation 7.45 words. (Thus the sample size is five.)

  • Determine if these data provide sufficient evidence, at the 1% level of significance, to conclude that the mean average sentence length in the document is less than 48.72.
  • Estimate the p -value of the test.
  • Based on the answers to parts (a) and (b), state whether or not it is likely that the document was written by Oberon Theseus.
  • Z ≤ − 1.645
  • T ≤ − 2.571 or T ≥ 2.571
  • Z ≤ − 1645 or Z ≥ 1.645
  • T ≤ − 0.855
  • T ≤ − 2.201 or T ≥ 2.201
  • T = − 2.690 , d f = 19 , − t 0.005 = − 2.861 , do not reject H 0 .
  • 0.01 < p  -value < 0.02 , α = 0.01 , do not reject H 0 .
  • T = 2.398, d f = 7 , t 0.05 = 1.895 , reject H 0 .
  • 0.01 < p  -value < 0.025 , α = 0.05 , reject H 0 .

T = − 7.560 , d f = 12 , − t 0.10 = − 1.356 , reject H 0 .

T = − 7.076 , d f = 5 , − t 0.0005 = − 6.869 , reject H 0 .

  • T = − 1.483 , d f = 14 , − t 0.05 = − 1.761 , do not reject H 0 ;
  • T = − 1.483 , d f = 14 , − t 0.10 = − 1.345 , reject H 0 ;
  • T = 2.069, d f = 6 , t 0.10 = 1.44 , reject H 0 ;
  • T = 2.069, d f = 6 , t 0.05 = 1.943 , reject H 0 .

T = 4.472, d f = 4 , t 0.10 = 1.533 , reject H 0 .

T = 0.798, d f = 24 , t 0.10 = 1.318 , do not reject H 0 .

  • T = − 1.773 , d f = 4 , − t 0.05 = − 2.132 , do not reject H 0 .
  • 0.05 < p  -value < 0.10
  • α = 0.05 , do not reject H 0

small sample size hypothesis tests

Hypothesis Testing for Means & Proportions

  •   1  
  • |   2  
  • |   3  
  • |   4  
  • |   5  
  • |   6  
  • |   7  
  • |   8  
  • |   9  
  • |   10  

On This Page sidebar

Hypothesis Testing: Upper-, Lower, and Two Tailed Tests

Type i and type ii errors.

Learn More sidebar

All Modules

More Resources sidebar

Z score Table

t score Table

The procedure for hypothesis testing is based on the ideas described above. Specifically, we set up competing hypotheses, select a random sample from the population of interest and compute summary statistics. We then determine whether the sample data supports the null or alternative hypotheses. The procedure can be broken down into the following five steps.  

  • Step 1. Set up hypotheses and select the level of significance α.

H 0 : Null hypothesis (no change, no difference);  

H 1 : Research hypothesis (investigator's belief); α =0.05

 

Upper-tailed, Lower-tailed, Two-tailed Tests

The research or alternative hypothesis can take one of three forms. An investigator might believe that the parameter has increased, decreased or changed. For example, an investigator might hypothesize:  

: μ > μ , where μ is the comparator or null value (e.g., μ =191 in our example about weight in men in 2006) and an increase is hypothesized - this type of test is called an ; : μ < μ , where a decrease is hypothesized and this is called a ; or : μ ≠ μ where a difference is hypothesized and this is called a .  

The exact form of the research hypothesis depends on the investigator's belief about the parameter of interest and whether it has possibly increased, decreased or is different from the null value. The research hypothesis is set up by the investigator before any data are collected.

 

  • Step 2. Select the appropriate test statistic.  

The test statistic is a single number that summarizes the sample information.   An example of a test statistic is the Z statistic computed as follows:

When the sample size is small, we will use t statistics (just as we did when constructing confidence intervals for small samples). As we present each scenario, alternative test statistics are provided along with conditions for their appropriate use.

  • Step 3.  Set up decision rule.  

The decision rule is a statement that tells under what circumstances to reject the null hypothesis. The decision rule is based on specific values of the test statistic (e.g., reject H 0 if Z > 1.645). The decision rule for a specific test depends on 3 factors: the research or alternative hypothesis, the test statistic and the level of significance. Each is discussed below.

  • The decision rule depends on whether an upper-tailed, lower-tailed, or two-tailed test is proposed. In an upper-tailed test the decision rule has investigators reject H 0 if the test statistic is larger than the critical value. In a lower-tailed test the decision rule has investigators reject H 0 if the test statistic is smaller than the critical value.  In a two-tailed test the decision rule has investigators reject H 0 if the test statistic is extreme, either larger than an upper critical value or smaller than a lower critical value.
  • The exact form of the test statistic is also important in determining the decision rule. If the test statistic follows the standard normal distribution (Z), then the decision rule will be based on the standard normal distribution. If the test statistic follows the t distribution, then the decision rule will be based on the t distribution. The appropriate critical value will be selected from the t distribution again depending on the specific alternative hypothesis and the level of significance.  
  • The third factor is the level of significance. The level of significance which is selected in Step 1 (e.g., α =0.05) dictates the critical value.   For example, in an upper tailed Z test, if α =0.05 then the critical value is Z=1.645.  

The following figures illustrate the rejection regions defined by the decision rule for upper-, lower- and two-tailed Z tests with α=0.05. Notice that the rejection regions are in the upper, lower and both tails of the curves, respectively. The decision rules are written below each figure.

Rejection Region for Upper-Tailed Z Test (H : μ > μ ) with α=0.05

The decision rule is: Reject H if Z 1.645.

 

 

α

Z

0.10

1.282

0.05

1.645

0.025

1.960

0.010

2.326

0.005

2.576

0.001

3.090

0.0001

3.719

Standard normal distribution with lower tail at -1.645 and alpha=0.05

Rejection Region for Lower-Tailed Z Test (H 1 : μ < μ 0 ) with α =0.05

The decision rule is: Reject H 0 if Z < 1.645.

a

Z

0.10

-1.282

0.05

-1.645

0.025

-1.960

0.010

-2.326

0.005

-2.576

0.001

-3.090

0.0001

-3.719

Standard normal distribution with two tails

Rejection Region for Two-Tailed Z Test (H 1 : μ ≠ μ 0 ) with α =0.05

The decision rule is: Reject H 0 if Z < -1.960 or if Z > 1.960.

0.20

1.282

0.10

1.645

0.05

1.960

0.010

2.576

0.001

3.291

0.0001

3.819

The complete table of critical values of Z for upper, lower and two-tailed tests can be found in the table of Z values to the right in "Other Resources."

Critical values of t for upper, lower and two-tailed tests can be found in the table of t values in "Other Resources."

  • Step 4. Compute the test statistic.  

Here we compute the test statistic by substituting the observed sample data into the test statistic identified in Step 2.

  • Step 5. Conclusion.  

The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion will be either to reject the null hypothesis (because the sample data are very unlikely if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely).  

If the null hypothesis is rejected, then an exact significance level is computed to describe the likelihood of observing the sample data assuming that the null hypothesis is true. The exact level of significance is called the p-value and it will be less than the chosen level of significance if we reject H 0 .

Statistical computing packages provide exact p-values as part of their standard output for hypothesis tests. In fact, when using a statistical computing package, the steps outlined about can be abbreviated. The hypotheses (step 1) should always be set up in advance of any analysis and the significance criterion should also be determined (e.g., α =0.05). Statistical computing packages will produce the test statistic (usually reporting the test statistic as t) and a p-value. The investigator can then determine statistical significance using the following: If p < α then reject H 0 .  

 

 

  • Step 1. Set up hypotheses and determine level of significance

H 0 : μ = 191 H 1 : μ > 191                 α =0.05

The research hypothesis is that weights have increased, and therefore an upper tailed test is used.

  • Step 2. Select the appropriate test statistic.

Because the sample size is large (n > 30) the appropriate test statistic is

  • Step 3. Set up decision rule.  

In this example, we are performing an upper tailed test (H 1 : μ> 191), with a Z test statistic and selected α =0.05.   Reject H 0 if Z > 1.645.

We now substitute the sample data into the formula for the test statistic identified in Step 2.  

We reject H 0 because 2.38 > 1.645. We have statistically significant evidence at a =0.05, to show that the mean weight in men in 2006 is more than 191 pounds. Because we rejected the null hypothesis, we now approximate the p-value which is the likelihood of observing the sample data if the null hypothesis is true. An alternative definition of the p-value is the smallest level of significance where we can still reject H 0 . In this example, we observed Z=2.38 and for α=0.05, the critical value was 1.645. Because 2.38 exceeded 1.645 we rejected H 0 . In our conclusion we reported a statistically significant increase in mean weight at a 5% level of significance. Using the table of critical values for upper tailed tests, we can approximate the p-value. If we select α=0.025, the critical value is 1.96, and we still reject H 0 because 2.38 > 1.960. If we select α=0.010 the critical value is 2.326, and we still reject H 0 because 2.38 > 2.326. However, if we select α=0.005, the critical value is 2.576, and we cannot reject H 0 because 2.38 < 2.576. Therefore, the smallest α where we still reject H 0 is 0.010. This is the p-value. A statistical computing package would produce a more precise p-value which would be in between 0.005 and 0.010. Here we are approximating the p-value and would report p < 0.010.                  

In all tests of hypothesis, there are two types of errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true. This is also called a false positive result (as we incorrectly conclude that the research hypothesis is true when in fact it is not). When we run a test of hypothesis and decide to reject H 0 (e.g., because the test statistic exceeds the critical value in an upper tailed test) then either we make a correct decision because the research hypothesis is true or we commit a Type I error. The different conclusions are summarized in the table below. Note that we will never know whether the null hypothesis is really true or false (i.e., we will never know which row of the following table reflects reality).

Table - Conclusions in Test of Hypothesis

 

is True

Correct Decision

Type I Error

is False

Type II Error

Correct Decision

In the first step of the hypothesis test, we select a level of significance, α, and α= P(Type I error). Because we purposely select a small value for α, we control the probability of committing a Type I error. For example, if we select α=0.05, and our test tells us to reject H 0 , then there is a 5% probability that we commit a Type I error. Most investigators are very comfortable with this and are confident when rejecting H 0 that the research hypothesis is true (as it is the more likely scenario when we reject H 0 ).

When we run a test of hypothesis and decide not to reject H 0 (e.g., because the test statistic is below the critical value in an upper tailed test) then either we make a correct decision because the null hypothesis is true or we commit a Type II error. Beta (β) represents the probability of a Type II error and is defined as follows: β=P(Type II error) = P(Do not Reject H 0 | H 0 is false). Unfortunately, we cannot choose β to be small (e.g., 0.05) to control the probability of committing a Type II error because β depends on several factors including the sample size, α, and the research hypothesis. When we do not reject H 0 , it may be very likely that we are committing a Type II error (i.e., failing to reject H 0 when in fact it is false). Therefore, when tests are run and the null hypothesis is not rejected we often make a weak concluding statement allowing for the possibility that we might be committing a Type II error. If we do not reject H 0 , we conclude that we do not have significant evidence to show that H 1 is true. We do not conclude that H 0 is true.

Lightbulb icon signifying an important idea

 The most common reason for a Type II error is a small sample size.

return to top | previous page | next page

Content ©2017. All Rights Reserved. Date last modified: November 6, 2017. Wayne W. LaMorte, MD, PhD, MPH

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Psychol Med
  • v.42(1); Jan-Feb 2020

Sample Size and its Importance in Research

Chittaranjan andrade.

Clinical Psychopharmacology Unit, Department of Clinical Psychopharmacology and Neurotoxicology, National Institute of Mental Health and Neurosciences, Bengaluru, Karnataka, India

The sample size for a study needs to be estimated at the time the study is proposed; too large a sample is unnecessary and unethical, and too small a sample is unscientific and also unethical. The necessary sample size can be calculated, using statistical software, based on certain assumptions. If no assumptions can be made, then an arbitrary sample size is set for a pilot study. This article discusses sample size and how it relates to matters such as ethics, statistical power, the primary and secondary hypotheses in a study, and findings from larger vs. smaller samples.

Studies are conducted on samples because it is usually impossible to study the entire population. Conclusions drawn from samples are intended to be generalized to the population, and sometimes to the future as well. The sample must therefore be representative of the population. This is best ensured by the use of proper methods of sampling. The sample must also be adequate in size – in fact, no more and no less.

SAMPLE SIZE AND ETHICS

A sample that is larger than necessary will be better representative of the population and will hence provide more accurate results. However, beyond a certain point, the increase in accuracy will be small and hence not worth the effort and expense involved in recruiting the extra patients. Furthermore, an overly large sample would inconvenience more patients than might be necessary for the study objectives; this is unethical. In contrast, a sample that is smaller than necessary would have insufficient statistical power to answer the primary research question, and a statistically nonsignificant result could merely be because of inadequate sample size (Type 2 or false negative error). Thus, a small sample could result in the patients in the study being inconvenienced with no benefit to future patients or to science. This is also unethical.

In this regard, inconvenience to patients refers to the time that they spend in clinical assessments and to the psychological and physical discomfort that they experience in assessments such as interviews, blood sampling, and other procedures.

ESTIMATING SAMPLE SIZE

So how large should a sample be? In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80%, and some set the threshold for significance at 0.01 rather than 0.05. Both choices are uncommon because the necessary sample size becomes large, and the study becomes more expensive and more difficult to conduct. Many investigators increase the sample size by 10%, or by whatever proportion they can justify, to compensate for expected dropout, incomplete records, biological specimens that do not meet laboratory requirements for testing, and other study-related problems.

Sample size calculations require assumptions about expected means and standard deviations, or event risks, in different groups; or, upon expected effect sizes. For example, a study may be powered to detect an effect size of 0.5; or a response rate of 60% with drug vs. 40% with placebo.[ 1 ] When no guesstimates or expectations are possible, pilot studies are conducted on a sample that is arbitrary in size but what might be considered reasonable for the field.

The sample size may need to be larger in multicenter studies because of statistical noise (due to variations in patient characteristics, nonspecific treatment characteristics, rating practices, environments, etc. between study centers).[ 2 ] Sample size calculations can be performed manually or using statistical software; online calculators that provide free service can easily be identified by search engines. G*Power is an example of a free, downloadable program for sample size estimation. The manual and tutorial for G*Power can also be downloaded.

PRIMARY AND SECONDARY ANALYSES

The sample size is calculated for the primary hypothesis of the study. What is the difference between the primary hypothesis, primary outcome and primary outcome measure? As an example, the primary outcome may be a reduction in the severity of depression, the primary outcome measure may be the Montgomery-Asberg Depression Rating Scale (MADRS) and the primary hypothesis may be that reduction in MADRS scores is greater with the drug than with placebo. The primary hypothesis is tested in the primary analysis.

Studies almost always have many hypotheses; for example, that the study drug will outperform placebo on measures of depression, suicidality, anxiety, disability and quality of life. The sample size necessary for adequate statistical power to test each of these hypotheses will be different. Because a study can have only one sample size, it can be powered for only one outcome, the primary outcome. Therefore, the study would be either overpowered or underpowered for the other outcomes. These outcomes are therefore called secondary outcomes, and are associated with secondary hypotheses, and are tested in secondary analyses. Secondary analyses are generally considered exploratory because when many hypotheses in a study are each tested at a P < 0.05 level for significance, some may emerge statistically significant by chance (Type 1 or false positive errors).[ 3 ]

INTERPRETING RESULTS

Here is an interesting question. A test of the primary hypothesis yielded a P value of 0.07. Might we conclude that our sample was underpowered for the study and that, had our sample been larger, we would have identified a significant result? No! The reason is that larger samples will more accurately represent the population value, whereas smaller samples could be off the mark in either direction – towards or away from the population value. In this context, readers should also note that no matter how small the P value for an estimate is, the population value of that estimate remains the same.[ 4 ]

On a parting note, it is unlikely that population values will be null. That is, for example, that the response rate to the drug will be exactly the same as that to placebo, or that the correlation between height and age at onset of schizophrenia will be zero. If the sample size is large enough, even such small differences between groups, or trivial correlations, would be detected as being statistically significant. This does not mean that the findings are clinically significant.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

25.3 - calculating sample size.

Before we learn how to calculate the sample size that is necessary to achieve a hypothesis test with a certain power, it might behoove us to understand the effect that sample size has on power. Let's investigate by returning to our IQ example.

Example 25-3 Section  

Let \(X\) denote the IQ of a randomly selected adult American. Assume, a bit unrealistically again, that \(X\) is normally distributed with unknown mean \(\mu\) and (a strangely known) standard deviation of 16. This time, instead of taking a random sample of \(n=16\) students, let's increase the sample size to \(n=64\). And, while setting the probability of committing a Type I error to \(\alpha=0.05\), test the null hypothesis \(H_0:\mu=100\) against the alternative hypothesis that \(H_A:\mu>100\).

What is the power of the hypothesis test when \(\mu=108\), \(\mu=112\), and \(\mu=116\)?

Setting \(\alpha\), the probability of committing a Type I error, to 0.05, implies that we should reject the null hypothesis when the test statistic \(Z\ge 1.645\), or equivalently, when the observed sample mean is 103.29 or greater:

\( \bar{x} = \mu + z \left(\dfrac{\sigma}{\sqrt{n}} \right) = 100 +1.645\left(\dfrac{16}{\sqrt{64}} \right) = 103.29\)

Therefore, the power function \K(\mu)\), when \(\mu>100\) is the true value, is:

\( K(\mu) = P(\bar{X} \ge 103.29 | \mu) = P \left(Z \ge \dfrac{103.29 - \mu}{16 / \sqrt{64}} \right) = 1 - \Phi \left(\dfrac{103.29 - \mu}{2} \right)\)

Therefore, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=108\) is 0.9907, as calculated here:

\(K(108) = 1 - \Phi \left( \dfrac{103.29-108}{2} \right) = 1- \Phi(-2.355) = 0.9907 \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=112\) is greater than 0.9999, as calculated here:

\( K(112) = 1 - \Phi \left( \dfrac{103.29-112}{2} \right) = 1- \Phi(-4.355) = 0.9999\ldots \)

And, the probability of rejecting the null hypothesis at the \(\alpha=0.05\) level when \(\mu=116\) is greater than 0.999999, as calculated here:

\( K(116) = 1 - \Phi \left( \dfrac{103.29-116}{2} \right) = 1- \Phi(-6.355) = 0.999999... \)

In summary, in the various examples throughout this lesson, we have calculated the power of testing \(H_0:\mu=100\) against \(H_A:\mu>100\) for two sample sizes ( \(n=16\) and \(n=64\)) and for three possible values of the mean ( \(\mu=108\), \(\mu=112\), and \(\mu=116\)). Here's a summary of our power calculations:

As you can see, our work suggests that for a given value of the mean \(\mu\) under the alternative hypothesis, the larger the sample size \(n\), the greater the power \(K(\mu)\) . Perhaps there is no better way to see this than graphically by plotting the two power functions simultaneously, one when \(n=16\) and the other when \(n=64\):

As this plot suggests, if we are interested in increasing our chance of rejecting the null hypothesis when the alternative hypothesis is true, we can do so by increasing our sample size \(n\). This benefit is perhaps even greatest for values of the mean that are close to the value of the mean assumed under the null hypothesis. Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power.

Example 25-4 Section  

corn field

Let \(X\) denote the crop yield of corn measured in the number of bushels per acre. Assume (unrealistically) that \(X\) is normally distributed with unknown mean \(\mu\) and standard deviation \(\sigma=6\). An agricultural researcher is working to increase the current average yield from 40 bushels per acre. Therefore, he is interested in testing, at the \(\alpha=0.05\) level, the null hypothesis \(H_0:\mu=40\) against the alternative hypothesis that \(H_A:\mu>40\). Find the sample size \(n\) that is necessary to achieve 0.90 power at the alternative \(\mu=45\).

As is always the case, we need to start by finding a threshold value \(c\), such that if the sample mean is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.05\) level, the following statement must hold (using our typical \(Z\) transformation):

\(c = 40 + 1.645 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

But, that's not the only condition that \(c\) must meet, because \(c\) also needs to be defined to ensure that our power is 0.90 or, alternatively, that the probability of a Type II error is 0.10. That would happen if there was a 10% chance that our test statistic fell short of \(c\) when \(\mu=45\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.90 power, the following statement must hold (using our usual \(Z\) transformation):

\(c = 45 - 1.28 \left( \dfrac{6}{\sqrt{n}} \right) \) (**)

Aha! We have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(40+1.645\left(\frac{6}{\sqrt{n}}\right)=45-1.28\left(\frac{6}{\sqrt{n}}\right)\) \(\Rightarrow 5=(1.645+1.28)\left(\frac{6}{\sqrt{n}}\right), \qquad \Rightarrow 5=\frac{17.55}{\sqrt{n}}, \qquad n=(3.51)^2=12.3201\approx 13\)

Now that we know we will set \(n=13\), we can solve for our threshold value c :

\( c = 40 + 1.645 \left( \dfrac{6}{\sqrt{13}} \right)=42.737 \)

So, in summary, if the agricultural researcher collects data on \(n=13\) corn plots, and rejects his null hypothesis \(H_0:\mu=40\) if the average crop yield of the 13 plots is greater than 42.737 bushels per acre, he will have a 5% chance of committing a Type I error and a 10% chance of committing a Type II error if the population mean \(\mu\) were actually 45 bushels per acre.

Example 25-5 Section  

politician

Consider \(p\), the true proportion of voters who favor a particular political candidate. A pollster is interested in testing at the \(\alpha=0.01\) level, the null hypothesis \(H_0:9=0.5\) against the alternative hypothesis that \(H_A:p>0.5\). Find the sample size \(n\) that is necessary to achieve 0.80 power at the alternative \(p=0.55\).

In this case, because we are interested in performing a hypothesis test about a population proportion \(p\), we use the \(Z\)-statistic:

\(Z = \dfrac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \)

Again, we start by finding a threshold value \(c\), such that if the observed sample proportion is larger than \(c\), we'll reject the null hypothesis:

That is, in order for our hypothesis test to be conducted at the \(\alpha=0.01\) level, the following statement must hold:

\(c = 0.5 + 2.326 \sqrt{ \dfrac{(0.5)(0.5)}{n}} \) (**)

But, again, that's not the only condition that c must meet, because \(c\) also needs to be defined to ensure that our power is 0.80 or, alternatively, that the probability of a Type II error is 0.20. That would happen if there was a 20% chance that our test statistic fell short of \(c\) when \(p=0.55\), as the following drawing illustrates in blue:

This illustration suggests that in order for our hypothesis test to have 0.80 power, the following statement must hold:

\(c = 0.55 - 0.842 \sqrt{ \dfrac{(0.55)(0.45)}{n}} \) (**)

Again, we have two (asterisked (**)) equations and two unknowns! All we need to do is equate the equations, and solve for \(n\). Doing so, we get:

\(0.5+2.326\sqrt{\dfrac{0.5(0.5)}{n}}=0.55-0.842\sqrt{\dfrac{0.55(0.45)}{n}} \\ 2.326\dfrac{\sqrt{0.25}}{\sqrt{n}}+0.842\dfrac{\sqrt{0.2475}}{\sqrt{n}}=0.55-0.5 \\ \dfrac{1}{\sqrt{n}}(1.5818897)=0.05 \qquad \Rightarrow n\approx \left(\dfrac{1.5818897}{0.05}\right)^2 = 1000.95 \approx 1001 \)

Now that we know we will set \(n=1001\), we can solve for our threshold value \(c\):

\(c = 0.5 + 2.326 \sqrt{\dfrac{(0.5)(0.5)}{1001}}= 0.5367 \)

So, in summary, if the pollster collects data on \(n=1001\) voters, and rejects his null hypothesis \(H_0:p=0.5\) if the proportion of sampled voters who favor the political candidate is greater than 0.5367, he will have a 1% chance of committing a Type I error and a 20% chance of committing a Type II error if the population proportion \(p\) were actually 0.55.

Incidentally, we can always check our work! Conducting the survey and subsequent hypothesis test as described above, the probability of committing a Type I error is:

\(\alpha= P(\hat{p} >0.5367 \text { if } p = 0.50) = P(Z > 2.3257) = 0.01 \)

and the probability of committing a Type II error is:

\(\beta = P(\hat{p} <0.5367 \text { if } p = 0.55) = P(Z < -0.846) = 0.199 \)

just as the pollster had desired.

We've illustrated several sample size calculations. Now, let's summarize the information that goes into a sample size calculation. In order to determine a sample size for a given hypothesis test, you need to specify:

The desired \(\alpha\) level, that is, your willingness to commit a Type I error.

The desired power or, equivalently, the desired \(\beta\) level, that is, your willingness to commit a Type II error.

A meaningful difference from the value of the parameter that is specified in the null hypothesis.

The standard deviation of the sample statistic or, at least, an estimate of the standard deviation (the "standard error") of the sample statistic.

Explore the Scientific R&D Platform

Top Bar background texture

Scientific intelligence platform for AI-powered data management and workflow automation

Prism logo

Statistical analysis and graphing software for scientists

Geneious logo

Bioinformatics, cloning, and antibody discovery software

Plan, visualize, & document core molecular biology procedures

Proteomics software for analysis of mass spec data

LabArchives logo

Electronic Lab Notebook to organize, search and share data

Modern cytometry analysis platform

Analysis, statistics, graphing and reporting of flow cytometry data

Easy Panel logo

Intelligent panel design & inventory management for flow cytometry

Software to optimize designs of clinical trials

M-Star logo

Computational fluid dynamics (CFD) software for engineers and researchers

SoftGenetics logo

Genetic analysis software for research, forensics & healthcare applications

Everything to Know About Sample Size Determination

A step-by-step interactive guide including common pitfalls

Nothing showing? Click here to accept marketing cookies


Download and explore the data yourself. Data files include:

  • Blinded Sample Size Re-estimation.nqt   
  • Blinded SiteAndSubject.nqt
  • Center Covariate Reducing Sample Size.nqt
  • Cluster Randomized Extension.nqt
  • External Pilot Study Sample Size Example.nqt
  • Log-Rank Test Everolimus.nqt
  • Log-Rank Test with Dropout.nqt
  • MaxCombo Model Selection with Delayed Effect.nqt
  • Responder Analysis Higher Sample Size Chi-Squared.nqt
  • Two Means Group Sequential Replication.nqt
  • Two Proportions Inequality Difference Scale.nqt
  • Two Proportions Inequality Ratio Scale.nqt
  • Two Proportions Non-inferiority Difference Scale.nqt
  • Two Proportions Non-inferiority Ratio Scale.nqt
  • Two Sample t-test Simvastin.nqt
  • Win Ratio for Composite Endpoint.nqt

Designing a trial involves considering and balancing a wide variety of clinical, logistical and statistical factors.

One decision out of many is how large a study needs to be to have a reasonable chance of success. Sample size determination is the process by which trialists can find the ideal number of participants to balance the statistical and practical aspects that inform study design. 

In this interactive webinar, we provided a comprehensive overview of sample size determination, the key steps to successfully finding the appropriate sample size and cover several common pitfalls researchers fall into when finding the sample size for their study.

In this free webinar we will cover

  • What is sample size determination?
  • A step-by-step guide to sample size determination
  • Common sample size pitfalls and solutions

+ Q&A about your sample size issues!

In most clinical trials, sample size determination is found by reaching a predefined statistical power - typically defined as the Type II error or how likely a significant p-value is under a given treatment effect.

Power calculations require pre-study knowledge about the study design, statistical error rates, nuisance parameters (such as the variance) and effect size with each of these adding additional complexity. 

Sample size determination has a number of common pitfalls which can lead to inappropriately small or large sample sizes with issues ranging from poor design decisions, misspecifying nuisance parameters or choosing the effect size inappropriately.

In this interactive webinar, we explore these and some solutions to avoid these mistakes and help maximise the efficiency of your clinical trial.  

We provide a comprehensive overview of sample size determination, the key steps to successfully finding the appropriate sample size and cover several common pitfalls researchers fall into when finding the sample size for their study.

Looking for more trial design and sample size resources? Check out webinars to improve clinical trial designs & practical examples of sample size determination

Try nQuery For Free

Browse our Webinars

Everything to Know About Sample Size Determination thumbnail image

Guide to Sample Size

Designing Robust Group Sequential Trials | Free nQuery Training thumbnail image

Designing Robust Group Sequential Trials | Free nQuery Training

Group Sequential Design Theory and Practice thumbnail image

Group Sequential Design Theory and Practice

Get started with nQuery today

Try for free and upgrade as your team grows

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Test difference between samples with very small sample size

Suppose I have observed 3 realizations of two non-negative, integer random variables $X$ and $Y$. Nothing is known about their underlying distribution. The results were $x = \{4,~8,~2\}$, and $y = \{22,~11,~8\}$. Hence, $\overline{x} = 14/3$ and $\overline{y} = 41/3$, for a sample mean difference of $9$.

My question is: is there any way to conduct a meaningful statistical hypothesis test in order to decide whether $H_0: E(X) = E(Y)$, can be rejected?

My first idea was to use a two-sample t-test or Welch-test, but I guess the sample size is way too low for that. Is there anything one can reasonably test?

  • hypothesis-testing
  • small-sample

gung - Reinstate Monica's user avatar

  • 4 $\begingroup$ You may find the following thread informative: is-there-a-minimum-sample-size-required-for-the-t-test-to-be-valid . $\endgroup$ –  gung - Reinstate Monica Commented Jun 25, 2013 at 14:52
  • $\begingroup$ What makes you say the sample size is too low? The big problem is that you can't check the assumptions, but a two sample t-test can actually be performed with two observations in each sample - or indeed, if you're prepared to assume equality of variance, with even as little as 3 observations total! $\endgroup$ –  Glen_b Commented Jun 26, 2013 at 2:28

2 Answers 2

There are potentially a number of ways of testing if these two samples differ, but all will probably have low power. You could use a t-test, but its validity will depend on whether the underlying populations are normally distributed and have equal variances. With so few data, you really can't check that very well so you have to rely entirely on prior knowledge (of which you say you have none) and the assumptions you are willing to make. Given that your variances are $9.3$ and $54.3$, I would not want to make the assumption of equal variances (although, again, with so few data they actually could be), so the Satterthwaite-Welch correction seems appropriate. If you weren't willing to assume the populations were exactly normal (since the central limit theorem cannot cover you with samples this small), you could use the Mann-Whitney U test . As it happens, that test gives a lower p-value ($.12$) than the corrected t-test ($.16$). The question then, is what you want to conclude from these results. My opinion is that using a rigid $.05$ cutoff is typically not appropriate (see my answer here: When to use Fisher and Neyman-Pearson framework? ); so I would say this result is somewhat ambiguous, but you might find it does provide some evidence against the null, depending on how plausible the null is a-priori.

Community's user avatar

  • 1 $\begingroup$ how did you compute your Mann-Whitney p-value? did you use the normal approximation? adjust for ties? etc.? The exact permutation test gives a p-value of 0.10 and since the MW test is a permutation test it should give a p-value of 0.10 as well (but I could see a small difference coming from an approximation rather than an exact test). $\endgroup$ –  Greg Snow Commented Jun 25, 2013 at 19:55
  • $\begingroup$ @GregSnow, I just typed the data into R & ran wilcox.test(x, y) . $\endgroup$ –  gung - Reinstate Monica Commented Jun 25, 2013 at 19:58
  • $\begingroup$ @Greg The M-W is not a permutation test of the mean, so it should be no surprise that it gives a (slightly) different p-value. With some of the permutations, such as $((2,4,22), (8,8,11))$, the order of the means ($28/3, 27/3$) differs from the order of the medians ($4, 8$) or the rank sums ($9, 12$). How you adjust for ties can make a substantial difference, too, depending on what test statistic is used in the permutation test. Gung: talking about "power" may be a little dicey when no alternative hypothesis is clearly formulated. $\endgroup$ –  whuber ♦ Commented Jun 25, 2013 at 20:16
  • $\begingroup$ @whuber, There are only 20 possible combinations for splitting 6 observations into 2 groups of 3 each, so any p-value from an exact permutation test will be a multiple of 0.05. Since the data has 2 8's, 1 in each group that means that any permutation that switches the 2 8's will be equivalent giving a p-value of 0.10 (though it should be doubled to 0.2 for a 2-sided test) for any statistic that measures location, whether it be mean, median, minimum, sum of ranks, 42nd percentile, etc. A permutation test based on ratio of variances could give a 0.05 for a 1-sided test. $\endgroup$ –  Greg Snow Commented Jun 25, 2013 at 20:45
  • 1 $\begingroup$ @gung I think the big problem with the Mann-Whitney on this tiny a sample size (3,3) is that - no matter how vastly separated the samples are, and even without any ties - you simply can't get a p-value below 0.1 (two tailed). At least a suitable parametric test (if its assumptions could be argued to be reasonable) has the possibility of reaching an interesting p-value. One needs to bring some assumptions (not necessarily normality) to this small a sample, and try to justify them. $\endgroup$ –  Glen_b Commented Jun 26, 2013 at 2:33

You say that nothing is known about the distribution, but you also say that the values are non-negative integers, so that tells you something, can you learn more about the theory?

An exact permutation test on your data gives a p-value of 0.10 and does not require assumptions about the distribution (just testing that it is identical for the 2 groups).

If your data represent counts (non-negative integers) then a Poisson model may be appropriate (which gives a p-value less than 0.001), but make sure that the assumptions are reasonable.

As @gung mentions, there are many ways to look at this and which is best is going to depend on the source of the data and the science that underlies it. learning about the data, talking to the person that gave you the data, talking to other experts, etc. is going to be of the most benefit.

If you really cannot learn anything more about the underlying distribution or approximations to it then you may have to resort to SnowsCorrectlySizedButOtherwiseUselessTestOfAnything (a function in the TeachingDemos package for R) which will give a p-value without requiring any assumptions about your data. But note that that function is considered less useful than its documentation.

Greg Snow's user avatar

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged hypothesis-testing t-test small-sample or ask your own question .

  • Featured on Meta
  • Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2:00...
  • User activation: Learnings and opportunities
  • Join Stack Overflow’s CEO and me for the first Stack IRL Community Event in...

Hot Network Questions

  • What is the oldest open math problem outside of number theory?
  • Tensor product of intersections in an abelian rigid monoidal category
  • Rocky Mountains Elevation Cutout
  • Offset+Length vs 2 Offsets
  • Definition of annuity
  • What was it that Wittgenstein found funny here?
  • The consequence of a good letter of recommendation when things do not work out
  • 1950s comic book about bowling ball looking creatures that inhabit the underground of Earth
  • Is it true that before European modernity, there were no "nations"?
  • How did people know that the war against the mimics was over?
  • Trying to find air crash for a case study
  • Drill perpendicular hole through thick lumber using handheld drill
  • Swapping front Shimano 105 R7000 34x50t 11sp Chainset with Shimano Deore FC-M5100 chainset; 11-speed 26x36t
  • Looking for a short story on chess, maybe published in Playboy decades ago?
  • Longtable goes beyond the right margin and footnote does not fit under the table
  • Why was Esther included in the canon?
  • On the history of algae classification
  • how does the US justice system combat rights violations that happen when bad practices are given a new name to avoid old rulings?
  • Is it a correct rendering of Acts 1,24 when the New World Translation puts in „Jehovah“ instead of Lord?
  • Need help for translating old signs
  • In Photoshop, when saving as PNG, why is the size of my output file bigger when I have more invisible layers in the original file?
  • Copyright Fair Use: Is using the phrase "Courtesy of" legally acceptable when no permission has been given?
  • How many engineers/scientists believed that human flight was imminent as of the late 19th/early 20th century?
  • Why was Panama Railroad in poor condition when US decided to build Panama Canal in 1904?

small sample size hypothesis tests

IMAGES

  1. Tests of Significance: Small Sample Test

    small sample size hypothesis tests

  2. Hypothesis Testing 3

    small sample size hypothesis tests

  3. Small sample hypothesis test

    small sample size hypothesis tests

  4. Hypothesis Testing

    small sample size hypothesis tests

  5. PPT

    small sample size hypothesis tests

  6. Hypothesis Testing with Two Samples

    small sample size hypothesis tests

VIDEO

  1. Small Sample Hypothesis Testing, Example 1

  2. Hypothesis Test Mean Small Sample Minitab

  3. Testing Of Hypothesis

  4. Testing Hypothesis: One Sample Test in SPSS (Urdu)

  5. Hypothesis Testing 3

  6. Small Sample Hypothesis Testing, Example 127

COMMENTS

  1. Hypothesis Test for Proportion (Small Sample)

    The first step is to state the null hypothesis and an alternative hypothesis. Null hypothesis: P >= 0.80. Alternative hypothesis: P < 0.80. Note that these hypotheses constitute a one-tailed test. The null hypothesis will be rejected only if the sample proportion is too small. Determine sampling distribution.

  2. 8.4: Small Sample Tests for a Population Mean

    where μ denotes the mean distance between the holes. Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is T = ˉx − μ0 s / √n and has the Student t -distribution with n − 1 = 4 − 1 = 3 degrees of freedom. Step 3. From the data we compute ˉx = 0.02075 and s = 0.00171.

  3. Best Practices for Using Statistics on Small Sample Sizes

    The right one depends on the type of data you have: continuous or discrete-binary. Comparing Means: If your data is generally continuous (not binary), such as task time or rating scales, use the two sample t-test. It's been shown to be accurate for small sample sizes. Comparing Two Proportions: If your data is binary (pass/fail, yes/no), then ...

  4. 6.5: Small Sample Hypothesis Testing for a Proportion (Special Topic)

    Since the hypothesis test is one-sided, the estimated p-value is equal to this tail area: 0.1222. Exercise 6.5.1. Because the estimated p-value is 0.1222, which is larger than the signi cance level 0.05, we do not reject the null hypothesis. Explain what this means in plain language in the context of the problem.

  5. PDF Chapter 6: Tests of Significance for Small Samples Tests of

    ample may be regarded as coming from population with coefficie. of correlationExercise 6A factory makes a machine part with axle diameter of 0.7 inch. A random sam. le of 10 parts shows a mean diameter of 0.742 inch with a standard d. viation of 0.04 inch. On the basis of this sample would you say that th.

  6. 8.4 Small Sample Tests for a Population Mean

    Step 1. The assertion for which evidence must be provided is that the average online price μ is less than the average price in retail stores, so the hypothesis test is. H 0: μ = 179 vs. H a: μ < 179 @ α = 0.05; Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is. T = x-− μ 0 s ∕ n

  7. 4: Tests for Ordinal Data and Small Samples

    For small to moderate sample size, the sampling distribution of \ (M^2\) is better approximated by a chi-squared distribution than are the sampling distributions for \ (X^2\) and \ (G^2\), the Pearson and LRT statistics, respectively; this tends to hold in general for distributions with smaller degrees of freedom.

  8. Hypothesis Testing for Means & Proportions

    Select the appropriate test statistic. Because the sample size is small (n<30) the appropriate test statistic is. Step 3. Set up decision rule. This is a lower tailed test, using a t statistic and a 5% level of significance. In order to determine the critical value of t, we need degrees of freedom, df, defined as df=n-1. In this example df=15-1=14.

  9. Small, Independent Samples

    The samples must be independent, the populations must be normal, and the population standard deviations must be equal. "Small" samples means that either n1 <30 or n2 <30. The quantity s2 p is called the pooled sample variance. It is a weighted average of the two estimates s2 1 and s2 2 of the common variance σ2 1 = σ2 2 of the two ...

  10. Hypothesis Testing: Upper-, Lower, and Two Tailed Tests

    We will assume the sample data are as follows: n=100, =197.1 and s=25.6. Step 1. Set up hypotheses and determine level of significance. H 0: μ = 191 H 1: μ > 191 α =0.05. The research hypothesis is that weights have increased, and therefore an upper tailed test is used. Step 2.

  11. PDF Lecture 17: Small sample proportions

    the Northeast is independent of another. Sample size is small (fewer than 10 expected successes) so we should use a simulation method. 48 0:20 = 9.6and 48 0:80 = 38:4 Statistics 101 (Prof. Rundel) L17: Small sample proportions November 1, 2011 14 / 28 Small sample inference for a proportion Hypothesis test (cont.)

  12. Hypothesis Testing for a Proportion and for Small Samples

    Hypothesis Testing for a Proportion and . for a Mean with Unknown Population Standard Deviation. Small Sample Hypothesis Tests For a Normal population. When we have a small sample from a normal population, we use the same method as a large sample except we use the t statistic instead of the z-statistic.Hence, we need to find the degrees of freedom (n - 1) and use the t-table in the back of the ...

  13. Sample Size and its Importance in Research

    In hypothesis testing studies, this is mathematically calculated, conventionally, as the sample size necessary to be 80% certain of identifying a statistically significant outcome should the hypothesis be true for the population, with P for statistical significance set at 0.05. Some investigators power their studies for 90% instead of 80%, and ...

  14. hypothesis testing

    Answers without enough detail may be edited or deleted. I read about the z-score intervals: For small samples (n < 50), if absolute z-scores for either skewness or kurtosis are larger than 1.96, which corresponds with an alpha level 0.05, then reject the null hypothesis and conclude the distribution of the sample is non-normal.

  15. 25.3

    Answer. Setting α, the probability of committing a Type I error, to 0.05, implies that we should reject the null hypothesis when the test statistic Z ≥ 1.645, or equivalently, when the observed sample mean is 103.29 or greater: because: x ¯ = μ + z (σ n) = 100 + 1.645 (16 64) = 103.29. Therefore, the power function \K (\mu)\), when μ ...

  16. Khan Academy

    Khan Academy. If you're seeing this message, it means we're having trouble loading external resources on our website. If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked. Explore. Search. AI for Teachers Donate Log in Sign up.

  17. Sample size determination

    The table shown on the right can be used in a two-sample t-test to estimate the sample sizes of an experimental group and a control group that are of equal size, that is, the total number of individuals in the trial is twice that of the number given, and the desired significance level is 0.05. [4] The parameters used are: The desired statistical power of the trial, shown in column to the left.

  18. Everything to Know About Sample Size Determination

    Optimize clinical trial sample sizes in this free webinar. Learn about statistical power, pitfalls, and solutions. ... Two Sample t-test Simvastin.nqt; ... Sample size determination has a number of common pitfalls which can lead to inappropriately small or large sample sizes with issues ranging from poor design decisions, misspecifying nuisance ...

  19. hypothesis testing

    More specific for your question I am surprised that the shapiro-test is actually able to find deviations given your small sample size of 3 samples per group. Are you sure you used it correctly? ... hypothesis-testing; small-sample; or ask your own question. Featured on Meta Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2 ...

  20. A significant result from an experiment with a small sample size

    I can think that small sample size is not enough to check the assumptions of the hypothesis test (e.g., distribution assumptions for parametric tests) that are being used. At the same time, extremely large sample size will always find such assumptions to be violated (e.g., it is impractical to think any natural distributions are exactly normal ...

  21. Test difference between samples with very small sample size

    A permutation test based on ratio of variances could give a 0.05 for a 1-sided test. @gung I think the big problem with the Mann-Whitney on this tiny a sample size (3,3) is that - no matter how vastly separated the samples are, and even without any ties - you simply can't get a p-value below 0.1 (two tailed).