## Some Initial Thoughts on using Least Median of Absolute Deviation for my Data Mining to Reduce Problems with Outliers

A while back I wrote on this blog a “cry for help” about some different forms of linear regression… which given the fact that it was a kind of deep topic in statistics and most of my friends and colleagues are not uber statistics nerds, I didn’t really get any replies… But I have persevered, and continued to dive in on my own, because as Khan Academy puts it, struggling with ideas improves the brain like lifting weights.

So after also looking at some real data that my initial data mining attempts found (using World Factbook data), I have realized that in order to avoid Type I errors (false positives) due to outliers, I need to use a form of “robust regression”. This is because while standard “least squares” regression neatly has a formula that will “perfectly” fit the data (which Khan Academy also explains well), any outlier will be squared, and have a far greater impact on the results than it should. And Least Absolute Deviation (LAD), which is more intuitive also has issues with outliers, just not as much.

But, another method is commonly used called “Least Median of Squares” (LMS), which attempts to square all the residuals (the distances between the estimated line of best fit, and each of the real data points), and then find the median of these (the very middle data point). But I have been wondering if the Least Median of Squares was invented to stick with the same paradigm as traditional regression, but maybe isn’t the most efficient method.

Instead, maybe it is better to find the Least Median of Absolute Deviation (LMAD), because using the Median Absolute Deviation (MAD) is more common than finding a Median of Squares for univariate data. And of more importance, I think it might require less computational power, given that I suspect taking the absolute value of a number usually just requires changing one bit in variable, while squaring requires more operations. So if the results from LMAD are not provably worse than the results from LMS, then it would seem that LMAD is likely a better way to go.

Although I’m going to see if some professional statisticians who know a whole hell of a lot more than me, can tell me if there is something I’m missing in my thought process. And who knows, maybe that same statistician will be interested in coauthoring an academic article on the topic.

#### Post Revisions:

This post has not been revised since publication.

Jacob, the least median method looks interesting. I’ve never tried it, but will keep it in mind for some financial forecasting that I do with statistics.

To deal with outliers, I sometimes remove them based on standard deviation from the mean. I’m sure you know that one SD represents 68% of your data points, two SD is 95% of data points, and three SD is 99.7% of data. Obviously, three choices does not always fit what is needed. For example, what if I want to eliminate 10% outliers, which poses the question, what is the SD for a 90% inclusion range? To answer this, I sometimes use a table that shows SD in 5% increments, that is often helpful.

Percentage

Distribution Std Dev

———— ——-

+/- 99.70% 3

+/- 95.45% 2

+/- 95% 1.960

+/- 90% 1.645

+/- 85% 1.440

+/- 80% 1.280

+/- 75% 1.250

+/- 70% 1.036

+/- 68.27% 1

+/- 65% .935

+/- 60% .841

+/- 55% .755

+/- 50% .675

+/- 45% .598

+/- 40% .524

+/- 35% .452

+/- 30% .386

+/- 25% .320

+/- 20% .254

+/- 15% .188

+/- 10% .126

+/- 5% .0627

+/- 2.5% .0313

However, I am careful about arbitrarily eliminating outliers since an outlier is often a data point at which something unusual but very real happened. Usually outliers don’t just happen randomly. But, you can still make the case for eliminating them if you are trying to forecast what is NORMAL rather than unusual. In finance, we call outliers nonrecurring items.

The Big Bopper29 Oct 15 at 3:44 pm

[…] various f curves of best fits with various underlying error distributions. (As I have realized my idea of removing “outliers” isn’t really solving the problem, because they are often there for a “reason”) […]

Non-Linear “Regression” using a Pattern Recognition Algorithm combined with Monte Carlo Simulations at Jacob J. Walker's Blog13 Nov 15 at 12:00 pm