Sample Size | Animal Charity Evaluators

The term sample size refers to the number of participants in a study, or other units of observation. For example, if a Researcher recruits 2000 people to fill out a survey, we would say that the sample size of the study is 2000. Sample size is important because it affects how confident we can be that the results of a study tell us something about reality. There are a number of free online tools you can use to calculate sample size.

Background

Researchers use statistical analysis to determine conclusively whether certain observations in our data (e.g. an association between variables, a difference between groups) are real, or simply the result of chance.

Let’s consider a simple leafleting experiment. Imagine we randomly assign each participant in an experiment either to a treatment condition where they are given a leaflet that advocates for veganism, or a control condition where they are given a leaflet on an unrelated topic. A few months later, we administer a Food Frequency Questionnaire to assess how many animal products participants in each group have consumed over the previous two days. Let’s say we observe on follow-up that participants in the control condition reported consuming an average of 12.1 servings, while participants in the treatment condition report consuming an average of 11.9 servings. What conclusion should we draw here? Do these results tell us whether the leafleting intervention had an effect on consumption?

Looking at these results, we might be tempted to conclude that the leafleting intervention reduced consumption of animal products. We cannot say this conclusively, however, because it is also possible that the observed difference between conditions is simply the result of chance. To estimate whether the leafleting intervention had a real effect on consumption, Researchers would typically conduct a statistical significance test. If they observe that the result is statistically significant (i.e. p < .05, conventionally), they might conclude that the difference between conditions is real. In contrast, if they observe that the result is not statistically significant, they might conclude that the difference between conditions is probably due to chance.

An important and critical feature of significance tests is that they are sensitive to sample size. With a large sample size, there is an increased likelihood that we will find an effect if one exists, even if it is small. In contrast, with a small sample size, it can be difficult to conclusively detect small effects in our data, even if they are real. This is a critical consideration for animal advocacy research because we often care about these effects (e.g. small changes in diet).

Although a large sample size is more useful, it can be costly and time consuming. To strike the right balance between statistical confidence and budget constraints, one strategy is to calculate a minimum detectable effect—the smallest possible effect that you care about—and then collect a sample size just large enough to be able to detect that effect.

How to Determine Sample Size

To determine the sample size required for the leafleting experiment above, we would calculate a minimum detectable effect that we expect and/or care to find. For this study design, our effect would be calculated as:

d = (M₂ – M₁) / SD

In this formula, (M₂ – M₁) is the difference in consumption between the treatment and control groups, and SD is the standard deviation (i.e., a measure of the amount of variation in consumption scores between participants).

To calculate the minimum detectable effect, we must first decide the smallest difference between the treatment and control group (i.e., M₂ – M₁) we would consider to be practically important. This decision usually does not have a statistical or mathematical basis and is fairly subjective, but sometimes we can calculate it based on what effect would make an intervention cost-effective. For example, if we decided we did not care about anything less than a mean difference of 1 serving (and assuming a standard deviation of 1.5 servings), our minimum detectable effect would be 0.67—which is large in social science research. Using this calculator¹ , we find we would only need roughly 75 participants to have the power to conclusively test whether or not this effect exists. In contrast, if we decided we did not care about anything less than a mean difference of 0.5 servings (and again assuming a standard deviation of 1.5 servings), our minimum detectable effect would be 0.33—which is moderate. We would need approximately 300 participants to have the power to conclusively test whether or not this effect exists. The smaller the effect we care about, the larger the required sample size.

In addition to specifying the minimum difference between groups, we also need to specify or estimate the standard deviation of our dependent variable. This can be done based on past research, or by conducting a pilot study. For example, if we conducted a pilot study and observed that the standard deviation is closer to 4.0 servings rather than 1.5 servings, our minimum detectable effect (assuming we are again trying to detect a mean difference of 0.5 servings) would become 0.13—which is small. We would need at least 2000 participants to have the power to conclusively test whether or not this effect exists. Thus, correctly estimating the standard deviation is critical to determining the required sample size.² The larger the standard deviation, the larger the required sample size.

Resources

Online tools for calculating sample size:
http://clincalc.com/Stats/SampleSize.aspx
http://www.gpower.hhu.de/en.html

Interactive tools to understand statistical power and significance testing:
http://rpsychologist.com/d3/NHST/
http://fivethirtyeight.com/features/science-isnt-broken/#part2

Analogies for understanding statistical power:
http://www.graphpad.com/guides/prism/7/statistics/index.htm?stat_statistical_power.htm
http://eric.ed.gov/?id=ED441009

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.

Howell, D. (2012). Statistical methods for psychology. Cengage Learning.

Greenland et al. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31, 337–350.

Alpha and power are conventionally set at .05 and .80, respectively.
Different studies use different effect size calculations. Some would not require estimating the standard deviation (e.g. tests of correlations or proportions).