How to master an ANOVA: Examples in Python and R
!! New !! This tutorial is now also accessible as interactive Jupyter Notebook in two parts: First part with Python kernel and second part with R kernel.
In one of my previous blog posts I talked about how to pick the right statistical hypothesis test for your experimental design. One of the most heavily used family of tests for psychological and in general for experimental research is Analysis of Variance (ANOVA). Most analysis frameworks have built-in implementations for ANOVAs – with different strengths and limitations. But in order to set up and interpret your ANOVA correctly, it is necessary to understand it in the more general context of linear models and linear regression.
Here, I wrote a tutorial of how to conduct an ANOVA in Python and R and how to assess the underlying models matrices. Note that I didn’t include the assumption checks or post-hoc tests that you would typically want to do. Furthermore, the examples below include 3 and 4-factorial ANOVAs to demonstrate the underlying principles, but in practice you might want to break down your design. Such ‘big’ ANOVAs are not recommended, because they don’t account for the multiple tests involved.
All code for the examples below can also be found in my GitHub repository.
The imports that I used for all Python code below are the following:
import os
import random
import numpy as np
import pandas as pd
import patsy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.regression.mixed_linear_model import MixedLMResults
from scipy import stats
import seaborn as sns
1-way ANOVA in Python (between-subject factor)
Let’s start with an example, where we are comparing an outcome measure in three different groups of subjects (healthy controls and two groups of patients, 10 subjects per group). Here, I’m simulating data with an effect of group and plot the data to inspect it.
# information on experimental design
group_list = ['control','patient1','patient2']
subs_list = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']
# read data into dataframe
df_1way = pd.DataFrame(columns=["group", "my_value"])
my_row = 0
for ind_g, group in enumerate(group_list):
for sub in subs_list:
# generate random value here as example
my_val = np.random.normal(ind_g, 1, 1)[0]
df_1way.loc[my_row] = [group, my_val]
my_row = my_row + 1
# inspect data
sns.catplot(x="group", y="my_value", data=df_1way, dodge=True, kind='violin', aspect=3)
plt.show()
In order to conduct an ANOVA, we need to need to perform three steps: 1) Generate a model that fits our design, 2) Fit our data to the model to obtain the parameter estimates, 3) Derive the statistics using a summary function of the model fit. In Python, these steps are implemented in the statsmodels
library. The general function to perform a linear regression (which is underlying an ANOVA) is ols
. You can specify your model for ols
using the same formula syntax that is used in R. If you conduct a 1-way ANOVA, i.e. you only have one categorical factor in your design, you can also use the f_oneway
function. If you run the code below, you will see that they give an identical result
# generate model for linear regression
my_model = smf.ols(formula='my_value ~ group', data=df_1way)
# fit model to data to obtain parameter estimates
my_model_fit = my_model.fit()
# print summary of linear regression
print(my_model_fit.summary())
# show anova table
anova_table = sm.stats.anova_lm(my_model_fit, typ=2)
print(anova_table)
OLS Regression Results ============================================================================== Dep. Variable: my_value R-squared: 0.378 Model: OLS Adj. R-squared: 0.332 Method: Least Squares F-statistic: 8.220 Date: Sun, 03 Nov 2019 Prob (F-statistic): 0.00163 Time: 11:53:16 Log-Likelihood: -36.303 No. Observations: 30 AIC: 78.61 Df Residuals: 27 BIC: 82.81 Df Model: 2 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- Intercept -0.1754 0.271 -0.648 0.522 -0.730 0.380 group[T.patient1] 0.7830 0.383 2.047 0.051 -0.002 1.568 group[T.patient2] 1.5511 0.383 4.054 0.000 0.766 2.336 ============================================================================== Omnibus: 0.172 Durbin-Watson: 2.579 Prob(Omnibus): 0.918 Jarque-Bera (JB): 0.098 Skew: -0.115 Prob(JB): 0.952 Kurtosis: 2.840 Cond. No. 3.73 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. sum_sq df F PR(>F) group 12.029330 2.0 8.219594 0.001629 Residual 19.757175 27.0 NaN NaN
# compare p-value to f_oneway analysis
F, p = stats.f_oneway(df_1way[df_1way['group'] == 'control'].my_value, df_1way[df_1way['group'] == 'patient1'].my_value, df_1way[df_1way['group'] == 'patient2'].my_value)
print(p)
0.0016293255849487813
2-way ANOVA in Python (between-subject factors)
In the next example, we are extending our design to include native language of the subjects as additional factor. This means that we are still in a fully between-subject design and each data point comes from a different subject. The function call to ols
with the *
operator will model both main effects for group and language and their interaction.
# information on experimental design
group_list = ['control','patient1','patient2']
language_list = ['English', 'German', 'French']
subs_list = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']
# read data into dataframe
df_2way = pd.DataFrame(columns=["group", "language", "my_value"])
my_row = 0
for ind_g, group in enumerate(group_list):
for ind_l, lan in enumerate(language_list):
for sub in subs_list:
# generate random value here as example
my_val = np.random.normal(ind_g + ind_l, 1, 1)[0]
df_2way.loc[my_row] = [group, lan, my_val]
my_row = my_row + 1
# plot data
sns.catplot(x="language", y="my_value", data=df_2way, dodge=True, hue='group', kind='violin', aspect=3)
plt.show()
# fit model to data to obtain parameter estimates
my_model_fit = smf.ols(formula='my_value ~ group * language', data=df_2way).fit()
# print summary of linear regression
print(my_model_fit.summary())
# show anova table
print(sm.stats.anova_lm(my_model_fit, typ=2))
OLS Regression Results ============================================================================== Dep. Variable: my_value R-squared: 0.630 Model: OLS Adj. R-squared: 0.594 Method: Least Squares F-statistic: 17.25 Date: Sun, 03 Nov 2019 Prob (F-statistic): 1.07e-14 Time: 11:53:17 Log-Likelihood: -115.57 No. Observations: 90 AIC: 249.1 Df Residuals: 81 BIC: 271.6 Df Model: 8 Covariance Type: nonrobust ======================================================================================================== coef std err t P>|t| [0.025 0.975] -------------------------------------------------------------------------------------------------------- Intercept -0.3168 0.291 -1.088 0.280 -0.896 0.263 group[T.patient1] 0.5484 0.412 1.331 0.187 -0.271 1.368 group[T.patient2] 2.0457 0.412 4.966 0.000 1.226 2.865 language[T.French] 2.1441 0.412 5.205 0.000 1.324 2.964 language[T.German] 0.8381 0.412 2.034 0.045 0.018 1.658 group[T.patient1]:language[T.French] 0.6454 0.583 1.108 0.271 -0.514 1.805 group[T.patient2]:language[T.French] -0.8923 0.583 -1.532 0.130 -2.051 0.267 group[T.patient1]:language[T.German] 0.6441 0.583 1.106 0.272 -0.515 1.803 group[T.patient2]:language[T.German] 0.0308 0.583 0.053 0.958 -1.128 1.190 ============================================================================== Omnibus: 0.018 Durbin-Watson: 1.767 Prob(Omnibus): 0.991 Jarque-Bera (JB): 0.061 Skew: 0.026 Prob(JB): 0.970 Kurtosis: 2.883 Cond. No. 13.9 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. sum_sq df F PR(>F) group 46.585744 2.0 27.450065 7.911130e-10 language 63.784731 2.0 37.584352 2.838481e-12 group:language 6.750387 4.0 1.988790 1.039990e-01 Residual 68.732903 81.0 NaN NaN
Patsy
At this point I would like to mention that statsmodels
internally uses the patsy
library to convert the specified formula to a model matrix. This is useful, because we can access and visualize the underlying model matrix. You can modify the design matrix, for example, to change the coding scheme for factorial categories from ‘treatment’ to ‘sum’ or use a different reference level.
# use Patsy to construct the model above
model_matrix = patsy.dmatrix("group * language", df_2way)
# visualize model
plt.show(plt.imshow(model_matrix, aspect='auto'))
# use sum coding scheme for factors
plt.show(plt.imshow(patsy.dmatrix("C(group, Sum) * C(language, Sum)", df_2way), aspect='auto'))
2-way Repeated measures ANOVA in Python (within-subject factors)
Let’s look at a different design, where we have repeated measures for each subject, which is common in psychological experiments. In this case we need to include random effects for each subject. We can conduct an ANOVA on such a design this using mixedlm
. In the examples here, we are modeling a random intercept for each subject, but by passing the ‘re_formula’ option, we can also include a random slope for each subject. If we only have a within-subject design, we can also use the AnovaRM
function in Python, however, only fully balanced within-subject designs are supported here. One general limitation for the Python implementations is that crossed random-effects are not supported, so we can only specify one factor to model the random effects.
In the example below, I’m simulating data from a single-group design with two factors: All subjects performed three different tasks before and after a treatment. Note that in the data simulation, I’m introducing the factor sub_id which is unique for each subject and differs from the subject ID that we defined in the folder system (in combination with the ‘group’ string, the subject ID gives a unique identifier within the BIDS folder format).
# information on experimental design
subs_list = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']
task_list = ['task1', 'task2', 'task3']
condition_list = ['pre', 'post']
# read data into dataframe
df_2way_rm = pd.DataFrame(columns=["sub_id", "task", "condition", "my_value"])
my_row = 0
# unique subject-ID as additional factor
sub_id = 0
for sub in subs_list:
sub_id = sub_id + 1
for ind_t, task in enumerate(task_list):
for ind_c, con in enumerate(condition_list):
# generate random value here as example
my_val = np.random.normal(ind_t + ind_c, 1, 1)[0]
df_2way_rm.loc[my_row] = [sub_id, task, con, my_val]
my_row = my_row + 1
# conduct ANOVA using mixedlm
my_model_fit = smf.mixedlm("my_value ~ task * condition", df_2way_rm, groups=df_2way_rm["sub_id"]).fit()
# get random effects
my_model_fit.random_effects
# get fixed effects (no f-test implemented)
my_model_fit.summary()
mydir/site-packages/statsmodels/regression/mixed_linear_model.py:2094: ConvergenceWarning: The MLE may be on the boundary of the parameter space. warnings.warn(msg, ConvergenceWarning)
Model: | MixedLM | Dependent Variable: | my_value |
No. Observations: | 60 | Method: | REML |
No. Groups: | 10 | Scale: | 0.9879 |
Min. group size: | 6 | Likelihood: | -83.2026 |
Max. group size: | 6 | Converged: | Yes |
Mean group size: | 6.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.889 | 0.314 | 2.828 | 0.005 | 0.273 | 1.505 |
task[T.task2] | 1.240 | 0.445 | 2.790 | 0.005 | 0.369 | 2.111 |
task[T.task3] | 2.452 | 0.445 | 5.516 | 0.000 | 1.581 | 3.323 |
condition[T.pre] | -0.466 | 0.445 | -1.049 | 0.294 | -1.337 | 0.405 |
task[T.task2]:condition[T.pre] | -0.412 | 0.629 | -0.655 | 0.512 | -1.644 | 0.820 |
task[T.task3]:condition[T.pre] | -1.049 | 0.629 | -1.668 | 0.095 | -2.281 | 0.183 |
Group Var | 0.000 | 0.169 |
# conduct ANOVA using AnovaRM
my_model_fit = AnovaRM(df_2way_rm, 'my_value', 'sub_id', within=['task', 'condition']).fit()
print(my_model_fit.anova_table)
F Value Num DF Den DF Pr > F task 14.983562 2.0 18.0 0.000148 condition 11.910377 1.0 9.0 0.007262 task:condition 1.676113 2.0 18.0 0.215013
4-way ANOVA with between-group and within-group factors (repeated measures)
If we wanted to conduct a mixed-model ANOVA that includes between-subject factors (group and language) and within-subject factors (task and condition), can do this using the mixedlm
function, similar as shown above:
group_list = ['control','patient1','patient2']
language_list = ['English', 'German', 'French']
subs_list = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10']
task_list = ['task1', 'task2', 'task3']
condition_list = ['pre', 'post']
# read data into dataframe
df_full = pd.DataFrame(columns=["group", "language", "sub_id", "task", "condition", "my_value"])
my_row = 0
# unique subject-ID
sub_id = 0
for ind_g, group in enumerate(group_list):
for ind_l, lan in enumerate(language_list):
for sub in subs_list:
sub_id = sub_id + 1
for ind_t, task in enumerate(task_list):
for ind_c, con in enumerate(condition_list):
# generate random value here as example
my_val = np.random.normal(ind_c + ind_t, 1, 1)[0]
df_full.loc[my_row] = [group, lan, sub_id, task, con, my_val]
my_row = my_row + 1
# conduct ANOVA using mixedlm
my_model_fit = smf.mixedlm("my_value ~ group * language * condition", df_full, groups=df_full["sub_id"]).fit()
# get random effects
my_model_fit.random_effects
# get fixed effects
my_model_fit.summary()
mydir/site-packages/statsmodels/regression/mixed_linear_model.py:2094: ConvergenceWarning: The MLE may be on the boundary of the parameter space. warnings.warn(msg, ConvergenceWarning)
Model: | MixedLM | Dependent Variable: | my_value |
No. Observations: | 540 | Method: | REML |
No. Groups: | 90 | Scale: | 1.7224 |
Min. group size: | 6 | Likelihood: | -913.2042 |
Max. group size: | 6 | Converged: | Yes |
Mean group size: | 6.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.088 | 0.240 | 8.716 | 0.000 | 1.619 | 2.558 |
group[T.patient1] | -0.242 | 0.339 | -0.713 | 0.476 | -0.906 | 0.423 |
group[T.patient2] | 0.031 | 0.339 | 0.092 | 0.927 | -0.633 | 0.695 |
language[T.French] | -0.073 | 0.339 | -0.216 | 0.829 | -0.737 | 0.591 |
language[T.German] | -0.333 | 0.339 | -0.982 | 0.326 | -0.997 | 0.331 |
etc…table continues…
Move from Python to R
As demonstrated above, most linear models can be succesfully be implemented in Python. The only limitation is that crossed-random effects are not supported. If this is needed, or for other reasons, we might want to run our analysis in R instead. An easy way to convert between both frameworks by writing out the dataframe to csv-format, which can be read by both Python and R. Here, we the data for two of the ANOVAs that we conducted above, to demonstrate that the results in R are exactly the same.
df_2way.to_csv('df_2way.csv', index=False)
df_full.to_csv('df_full.csv', index=False)
2-way ANOVA in R (between-subject factors)
We will start with the 2-way between-subjects ANOVA, which can be conducted with the R package lm
. We can also access the underlying model matrix and inspect it to verify that the same model is applied in both Python and R. Similar as in Patsy, we can also change the coding scheme for the categorical factors.
# ! R code now, not Python!
library(readr)
# read in data from 2-way ANOVA with between-subject factors
df_2way <- read_csv("df_2way.csv")
# fit linear model and get parameter estimates
model_fit <- lm(my_value ~ group * language, df_2way)
# display anova table
anova(model_fit)
# display results of linear regression
summary(model_fit)
Parsed with column specification: cols( group = [col_character(), language = [col_character(), my_value = [col_double() )
A anova: 4 × 5 Df Sum Sq Mean Sq F value Pr(>F) <int> <dbl> <dbl> <dbl> <dbl> group 2 46.585744 23.2928720 27.45006 7.911130e-10 language 2 63.784731 31.8923655 37.58435 2.838481e-12 group:language 4 6.750387 1.6875967 1.98879 1.039990e-01 Residuals 81 68.732903 0.8485544 NA NA
Call: lm(formula = my_value ~ group * language, data = df_2way) Residuals: Min 1Q Median 3Q Max -1.8680 -0.6640 0.1173 0.5952 2.5528 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.31684 0.29130 -1.088 0.2800 grouppatient1 0.54840 0.41196 1.331 0.1869 grouppatient2 2.04575 0.41196 4.966 3.73e-06 *** languageFrench 2.14408 0.41196 5.205 1.44e-06 *** languageGerman 0.83813 0.41196 2.034 0.0452 * grouppatient1:languageFrench 0.64535 0.58260 1.108 0.2713 grouppatient2:languageFrench -0.89227 0.58260 -1.532 0.1295 grouppatient1:languageGerman 0.64412 0.58260 1.106 0.2722 grouppatient2:languageGerman 0.03082 0.58260 0.053 0.9579 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9212 on 81 degrees of freedom Multiple R-squared: 0.6302, Adjusted R-squared: 0.5937 F-statistic: 17.25 on 8 and 81 DF, p-value: 1.065e-14
# access underlying model
my_glm = model.matrix(model_fit)
# inspect GLM
image(t(my_glm))
# change coding scheme
model_fit <- lm(my_value ~ group * language, df_2way, contrasts = list(group = "contr.sum", language = "contr.sum"))
# display anova table
anova(model_fit)
# display results of linear regression
summary(model_fit)
A anova: 4 × 5 Df Sum Sq Mean Sq F value Pr(>F) <int> <dbl> <dbl> <dbl> <dbl> group 2 46.585744 23.2928720 27.45006 7.911130e-10 language 2 63.784731 31.8923655 37.58435 2.838481e-12 group:language 4 6.750387 1.6875967 1.98879 1.039990e-01 Residuals 81 68.732903 0.8485544 NA NA
Call: lm(formula = my_value ~ group * language, data = df_2way, contrasts = list(group = "contr.sum", language = "contr.sum"))
Residuals: Min 1Q Median 3Q Max -1.8680 -0.6640 0.1173 0.5952 2.5528
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.58950 0.09710 16.370 < 2e-16 *** group1 -0.91227 0.13732 -6.643 3.26e-09 *** group2 0.06595 0.13732 0.480 0.6323 language1 -1.04163 0.13732 -7.585 4.90e-11 *** language2 1.02015 0.13732 7.429 9.91e-11 *** group1:language1 0.04756 0.19420 0.245 0.8072 group2:language1 -0.38227 0.19420 -1.968 0.0524 . group1:language2 0.12986 0.19420 0.669 0.5056 group2:language2 0.34539 0.19420 1.779 0.0791 . --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9212 on 81 degrees of freedom Multiple R-squared: 0.6302, Adjusted R-squared: 0.5937 F-statistic: 17.25 on 8 and 81 DF, p-value: 1.065e-14
4-way ANOVA in R (between-subject and within-subject factors)
In a very similar fashion, we can perform an ANOVA that includes within-subject factors and random effects. In this case we use the lme4
package.
library(lme4)
library(lmerTest)
library(readr)
# read in data from 4-way ANOVA with between-subject and within-subject factors
df_full <- read_csv("df_full.csv")
# get parameter estimates from a linear regression with random effects
my_model_fit <- lmer(my_value ~ group * language * task * condition + (1|sub_id), df_full)
# display results of linear regression
summary(my_model_fit)
Parsed with column specification: cols( group = [col_character(), language = [col_character(), sub_id = [col_double(), task = [col_character(), condition = [col_character(), my_value = [col_double() ) Correlation matrix not shown by default, as p = 54 > 12. Use print(obj, correlation=TRUE) or vcov(obj) if you need it
Linear mixed model fit by REML. t-tests use Satterthwaite's method [ lmerModLmerTest] Formula: my_value ~ group * language * task * condition + (1 | sub_id) Data: df_full
REML criterion at convergence: 1507.2 Scaled residuals: Min 1Q Median 3Q Max -2.90506 -0.63492 -0.01991 0.63309 2.96099
Random effects: Groups Name Variance Std.Dev. sub_id (Intercept) 0.008599 0.09273 Residual 0.999202 0.99960 Number of obs: 540, groups: sub_id, 90
Fixed effects: Estimate Std. Error (Intercept) 0.99426 0.31746 grouppatient1 -0.16761 0.44895 grouppatient2 0.09103 0.44895 languageFrench 0.33475 0.44895 languageGerman -0.15014 0.44895 tasktask2 1.54679 0.44704 tasktask3 1.73569 0.44704 conditionpre -0.13461 0.44704 grouppatient1:languageFrench -0.10274 0.63492 (...) etc...table continues...
# main and interaction effects
anova(my_model_fit)
# random effects
rand(my_model_fit)
A anova: 15 × 6 Sum Sq Mean Sq NumDF DenDF F value Pr(>F) <dbl> <dbl> <int> <dbl> <dbl> <dbl> group 1.6603036 0.8301518 2 81.0407 0.8308147 4.393706e-01 language 1.9710372 0.9855186 2 81.0407 0.9863055 3.773836e-01 task 367.6811701 183.8405850 2 405.0506 183.9873818 1.426573e-57 condition 118.4001814 118.4001814 1 405.0506 118.4947240 2.220945e-24 group:language 10.8603366 2.7150841 4 81.0407 2.7172521 3.532210e-02 group:task 3.0990351 0.7747588 4 405.0506 0.7753774 5.416750e-01 language:task 2.0196072 0.5049018 4 405.0506 0.5053050 7.318662e-01 (...) etc ... table continues
A anova: 2 × 6 npar logLik AIC LRT Df Pr(>Chisq) <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <none> 56 -753.6189 1619.238 NA NA NA (1 | sub_id) 55 -753.6621 1617.324 0.0862481 1 0.7690022 ```
# access underlying model for fixed effects
my_glm_fe = model.matrix(my_model_fit)
# access underlying model for random effects
my_glm_re = getME(my_model_fit, "Zt")
# inspect matrices
image(t(my_glm_fe))
image(t(my_glm_re))
That’s it!
As concluding remark, I would like to encourage you to build the model matrix for your ANOVA by hand and to compare it to the automatic matrix generation in Patsy using different coding schemes. General linear models are a powerful statistical tool that is widely used in psychological and neuroimaging data analysis, so it’s worth wrapping your head around the underlying principles. Below, I post some links that I found useful when preparing this tutorial.
Thanks for reading this post :-)
Nicole