Outliers in data are the weird ones in a set. Their values are way off the rest of the values of the sample. They can really ruin your analysis, especially if you are using methods which are sensitive to the presence of outliers.
Given this, a lot are inclined to remove these observations. While this may make things convenient, this approach may end up yield false claims.
How exactly do we deal with these troublemakers?
Understanding outliers
For starters, we have to identify why these values occur at the first place. Some candidates for inclusion are cases where the outliers are produced by human or measurement errors.
On the other hand, there are cases where these aberrations are just true observations from the data set.
Consider a data set containing the net worth of US citizens. The net worth of Bill Gates will then be no measurement error — it is simply an observation of a real yet rare event.
Given this, once you’ve identified potential outliers, check whether they are mere errors which you can omit or correct.
To keep or not to keep
Now, if the observation turns out to be an unusual yet true observation, you have to assess whether the retention or omission of the said data point will be beneficial for your analysis.
There really is no quick fix for outliers. It heavily depends on the context of your analysis as well as the needs of the problem at hand. However, here are some measures you might want to consider:
- Assess the importance of the outlier.Some outliers are produced by events that are due to the peculiar conditions. For instance, a decline in a company’s stock value may be due to a controversy which we do not expect to happen regularly. In this case, the omission of the outlier may be reasonable.
On the other hand, some extreme values are better left in the data set. For instance, a significantly high earthquake magnitude in a time series data should be retained since it could potentially occur again. This will also allow such a damaging event to be taken into account in the decision-making in which the analysis may be used for.
- Consider data transformations.There are instances that the impact of the outlying value is negated or minimized to a negligible level by a proper transformation. Trying some out might just do the trick.
- Consider reporting casesIf you are not sure whether the omission or retention is the way to go, you may also consider reporting both the cases where the outliers are retained and the case where the outliers are omitted. In this case, you retain the insights coming from both states. Doing this may also help deciding upon the omission or retention of the outlying values.
These are some soft guidelines you can consider. Again, these are NOT strict rules that you should follow all the time. Dealing with outliers is highly context-dependent. Data analysis is not a straight road, it is an art.
Weird is NOT wrong
To wrap things up, we see that outliers provide helpful insights that typical values may not provide. Therefore, we should not see these extremely different values as a nuisance.
Instead, we should examine why these values occur. Doing this will also give us the best way to deal with outliers. Let the data speak to you.
Removing the weirdos is not always the way to go. Trying to understand them might help you out more than you think.