When fitting a linear regression model is it necessary to have normally distributed variables?
One of the most common questions asked by a researcher who wants to analyse their data through a linear regression model is: must variables, both dependent and predictors, be distributed normally to have a correct model? So if they are not, should I normalize them through a transformation, for example the logarithmic one?
The answer is no: the estimation method used in linear regression, ordinary least squares (OLS) method, doesn’t not require the normality assumption.
So, if you see that a variable is not distributed normally, don’t be upset and go ahead: it is absolutely useless trying to normalize everything.
The only test of normality that you will need to perform, after fitting your regression model, is that of the residuals (i.e. the difference between estimated by the regression and the observed values of the dataset).
How can you do it? Let your statistical program calculate the residuals and show their distribution (as if they were some variable).
Alternatively, you can build an ad hoc plot to test normality (for example a qq-plot).
If, on the other hand, you really want to use a more objective decision criterion, that is not based on the plot interpretation, then you can do a normality test (for example Shapiro-Wilk Test).
Many statistical programs do almost all of this by default (for example SAS, via the PROC REG command).
However, there are at least a couple of reasons why you should take a look at the distribution of your “Y” and of all your “X”:
- the presence of highly skewed variables can, more likely, influence the distribution of residuals making them, in turn, non-normal;
- the presence of variables with very large tails of outliers could require a complex analysis of leverage (i.e. how much these outliers impact on the estimate of the regression coefficients).
Thus, for very skewed variables it might be a good idea to transform the data to eliminate the harmful effects.
In summary: it is a good habit to check graphically the distributions of all variables, both dependent and independent.
If some of them are slightly skewed, keep them as they are.
On the other hand, highly skewed variables should be normalized before fitting the model.
After fitting the model, it is necessary to make sure that the residuals are distributed normally, to ascertain its technical correctness.
I assume that you will check all other assumptions of linear regression then: linearity, homoscedasticity, absence of autocorrelation and multicollinearity.
Before concluding, one more thing to note: checking the normality of the residuals is necessary not only for estimating the regression coefficients, but also for calculating confidence intervals and p-values that you will get in your output (which interests you in most cases).