Outliers in data analysis: keep them or remove them?

The methods used to manage outliers during data analysis are not always correctly applied.
In this post we will try to understand how avoiding biased results or losing some information in our experiments.

When to eliminate the outliers?

The outliers can be eliminated easily, if you are sure that there are mistakes in the collection and/or in the reporting of data. For example, if you deal with the variable “age”; and after having graphed your data you realize that there is a 172 years old subject, this value cannot be used (obviously) in the analysis.
What you could possibly do is trying to find the correct value. That “172 years” should be
reasonably either “17” or “72”.

In my opinion this is the only case in which to ask whether to use or not the outlier has a clear answer: NO.

In all other cases it is needed to explore the data by asking precise questions:
Does the outlier affect the assumptions or the type of the analysis I am about to conduct?
Does the outlier create a statistical association that would not appear without it?
Let’s look at the example below.

There is a hypothetical outlier on the bottom right.

In this case it is evident that, excluding the outlier, the data could be modelled by a simple straight line with negative slope (red dotted line). Introducing the outlier in the analysis and maintaining the assumption of linearity, it is clear that we would have a highly biased result given by the outlier (green dotted line), meaning that the Y value will decrease much more with the increasing X.

To confirm this, you can conduct a leverage analysis of the data, an analysis that shows how much the parameters estimated by the analysis depend on the hypothetical outlier (maybe I’ll write an ad hoc article on this).

The strategy is: analyze your data with the linear model excluding the outlier and analyze them again the complete data (that is keeping the outlier) using another model. In our example, almost certainly, an exponential model could do a good job, as indicated by the blue dotted line in the figure below.

In some cases, it is even more evident that the statistical significance of the test you used is given by the presence of the outlier, as you can see in the figure below.

It is a modification of the case above where the green line shows a milder, but almost certainly present, statistical association.
In this last last example it is evident that the slope of the line and the presence of an association is due to the presence of the outlier.

The strategy in this case is to remove the outlier without modelling the data in another way (there is nothing to model: it is evident that the value of Y does not change with the variation of the value of X).

Even in this case, however, it is a good practice to provide the presence of this outlier in the results of the study or in the supplementary materials.

Do you need Help? Contact us
[contact-form-7 id=”140″ title=”Modulo di contatto 1″]

Leave a Reply

Your email address will not be published. Required fields are marked *