The difference between a regression analysis and analysis of variance (ANOVA) is one of the most frequent dilemmas among students and researchers. In this post we try to understand what this difference is and which of these two techniques the preferred one is.
From the mathematical point of view, linear regression and ANOVA are identical: both break down the total variance of the data into different “portions” and verify the equality of these “sub-variances” by means of a test (“F” Test). What can be added is that, in both techniques the dependent variable is a continuous one, but in the ANOVA analysis the independent variable can be exclusively categorical variable, while in the regression can be used both categorical and continuous independent variables. Thus, ANOVA can be considered as a case of a linear regression in which all predictors are categorical.
The difference that distinguishes linear regression from ANOVA is the way in which results are reported in all common Statistical Softwares.
Let’s look at an example.
Say you have 3 groups: medical students, engineering students and communication science students. Suppose that for each of three groups you measure the continuous variable “heart beat before exams” (obviously, it’s a fake variable).
Let’s make measurements and say that you get the following means:
medical students: 140.3 beats per minute;
engineering students: 150.7 beats per minute;
communication science students: 105 beats per minute.
Now let’s do both the ANOVA and the linear regression.
In the regression analysis, assuming that your statistical software considers the category “communication sciences” as a reference one, you will get the following coefficients:
medical students: 35.3
engineering students: 45.7
The intercept: 105.
The regression model provides you with two p-values: one for medical students and another for engineering students. Do both p-values test the hypothesis “the coefficient is different from zero?” or that “the difference with the mean of the reference category (communication science students) is equal to zero”?
Say you get p-values <0.05.
Let’s perform the ANOVA now. ANOVA provides you with the only p-value that tests the null hypothesis: “are three means equal?” (or “do they come from the same population?”, in other words).
All above shows you the same thing: the intercept of the regression model (105) is the average of the reference category (“communication science students”).
Two coefficients are nothing more than the difference with the reference category. For example, the coefficient of the category “medical students”, 35.3, is nothing more than the increase in the average heart beat compared to the “students of communication sciences” which represents the reference category. Therefore 105 (reference) + 35.3 (medical student coefficient) = 140.3 beats per minute (the average heart rate of future doctors before their exams).
As you can see, the only difference you can observe is the way in which the results and their conclusions are reported.
Now the question is: what is the criterion by which to choose between the regression and ANOVA?
There is no specific criterion actually.
Obviously, if in your research you predict some continuous variable (for example age) you are forced to use a linear regression.
I personally prefer using a regression model, for two reasons:
while ANOVA enables you to evaluate an “overall” effect that tells you if the means are the same, but in case they are not, it doesn’t tell you which of them is different; the regression model, with a p-value for each mean, tells you which of them is different from the reference one immediately.
The second reason is that the regression model provides the “estimate of the effect”, i.e. the difference between two averages with the 95% confidence interval.
But as we said this information is also contained in the output of the ANOVA.
It’s a matter of taste.