Dr proctor
Dr proctor

Reputation: 197

Python statsmodel robust linear regression (RLM) outlier selection

I'm analyzing a set of data and I need to find the regression for it. Th number of data points in the dataset are low (~15) and I decided to use the robust linear regression for the job. The problem is that the procedure is selecting some points as outlier that do not seem to be that influential. Here is scatter plot of the data, with their influence used as size:X vs Y. The Points B and C are selected as outliers, while point A is not.

Point B and C (shown with red circle in the figure) are selected as outliers, while point A which has way higher influence is not. Although point A does not change the general trend of the regression, it is basically defining the slope along with the point with the highest X. Whereas point B and C are only affecting the significance of slope. So my question has two parts: 1) What is the RLM package's method for selecting outliers if the most influential point is not selected and do you know of other packages that have a outlier selection that I have in mind? 2) Do you think that point A is an outlier?

Upvotes: 2

Views: 3520

Answers (1)

Josef
Josef

Reputation: 22897

RLM in statsmodels is limited to M-estimators. The default Huber norm is only robust to outliers in y, but not in x, i.e. not robust to bad influential points.

See for example http://www.statsmodels.org/devel/examples/notebooks/generated/robust_models_1.html line In [51] and after.

Redescending norms like bisquare are able to remove bad influential points but the solution is a local optimum and needs appropriate starting values. Methods that have a low breakdown point and are robust to x outliers like LTS are currently not available in statsmodels nor, AFAIK, anywhere else in Python. R has a more extensive suite of robust estimators that can handle these cases. Some extensions to add more methods and models in statsmodels.robust are in, currently stalled, pull requests.

In general and to answer the second part of the question:

In specific cases it is often difficult to declare or identify an observation as outlier. Very often researchers use robust methods to indicate outlier candidates that need further investigation. One reason for example could be that the "outliers" were sampled from a different population. Using a purely mechanical, statistical identification might not be appropriate in many cases.

In this example: If we fit a steep slope and drop point A as an outlier, then points B and C might fit reasonably well and are not identified as outliers. On the other hand, if A is a reasonable point based on extra information, then maybe the relationship is nonlinear. My guess is that LTS will declare A as the only outlier and fit a steep regression line.

Upvotes: 3

Related Questions