In many behavioural experiments we want to compare an outcome measure across different groups of subjects or different experimental conditions. But even after several years of doing data analysis, I have to remind myself about the right statistical analysis to perform even a simple hypothesis test. The fact that different analysis frameworks use different implementations of the tests further complicates the issue. That’s why I composed a decision tree for the situation, where we are comparing the average of a continuous dependent variable (i.e. the outcome measure) based on categorical variables.

The questions that you typically have to ask yourself are:

How many factors are included in the design? = How many categorical variables do I have?
How many levels does each factor have? = How many conditions do I have?
Do I have a beween-subjects or a within-subjects design? = Am I comparing one group or several groups?
Are the measures dependent or independent?
Do I have a repeated-measures design? Do I need to account for random effects for subjects?
Does my data fulfill the criteria for a parametric test (normal distribution, equal variances, etc.)?

The overview below might give you some guidance on which test to use. I also included the name of the test implementation in Python and R.

Decision tree for Statistical Hypothesis Tests

one factor, one level

independent measurements
- parametric test
  - t-test
  - python: scipy.stats.ttest_ind
  - R: t.test
- non-parametric test
  - Mann Whitney U test
  - python: scipy.stats.mannwhitneyu
  - R: wilcox.test (Mann-Whitney-Wilcoxon Test)
dependent measurements
- parametric test
  - paired t-test
  - one-sample t-test on the differences
  - equivalent: GLM with random effects for each subject
  - python: scipy.stats.ttest_rel
  - R: t.test(paired=TRUE)
- non-parametric test
  - Wilcoxon sum-rank test
  - python: scipy.stats.wilcoxon
  - R: wilcox.test(paired=TRUE) (Wilcoxon Signed-Rank Test)

one factor, multiple levels

independent measurements
- parametric test
  - one-way ANOVA
  - python: statsmodels.formula.api.ols
  - python: scipy.stats.f_oneway
  - R: lm
- non-parametric test
  - Kruskal-Wallis test
  - python: scipy.stats.kruskal
  - R: kruskal.test
dependent measurements
- parametric test
  - repeated-measures one-way ANOVA (with random effects)
  - python: statsmodels.stats.anova.AnovaRM
    (only implemented for fully balanced within-subject designs)
  - R: lm
- non-parametric test
  - Friedman test
  - python: scipy.stats.friedmanchisquare
  - R: friedman.test

two factors, multiple levels

independent measurements
- parametric test
  - two-way ANOVA
  - statsmodels.formula.api.ols
  - R: lme4 (lmer)
  - R: aov (not recommended)
- non-parametric test
  - Scheirer-Ray-Hare test
  - Python and R: not available
  - build a general linear mixed model by hand and do bootstrapping
dependent measurements
- parametric test
  - repeated measures two-way ANOVA
  - python: statsmodels.stats.anova import AnovaRM
    (only implemented for fully balanced within-subject designs)
  - python: statsmodels.formula.api.mixedlm
  - statsmodels does not support crossed random effects (i.e. only one group)
  - R: lme4 (lmer)
- non-parametric test
  - build a general linear mixed model by hand and do bootstrapping

more than two factors

independent measurements
- parametric test
  - n-way ANOVA
  - python and R, see above for two factors
  - non-parametric test
- non-parametric test
  - build a general linear mixed model by hand and do bootstrapping
dependent measurements
- parametric test
  - n-way repeated measures ANOVA
  - python and R, see above for two factors
- non-parametric test
  - python and R: not implemented
  - build a general linear mixed model by hand and do bootstrapping