bachts
bachts

Reputation: 85

Using scikit-learn IsolationForest with multiple columns of a pandas rolling object

my problem is as follow: I have a time series dataframe like so:

time   value1   value2   
1      random value here
2
3
4
5

I want to run IsolationForest for outlier detection on a rolling basis for column value1 and value2. My approach so far has been to create a custom function, and run it with rolling.

from sklearn.ensemble import IsolationForest
def outlier(data):
    clf = IsolationForest()
    prediction = clf.fit_predict(data[['value1', 'value2']].values)
    return prediction[-1]

then running apply on the data itself:

data.rolling(window='2D', on='time').apply(outlier)

This returns an error of KeyError: "None of [Index(['value1', 'value2'], dtype='object', name='time')] are in the [index]", which seems to be due to rolling only considering one column as once.

I then tried setting the parameter method='table' which returns ValueError: Data must be 1-dimensional, got ndarray of shape (1000, 2) instead. Apparently it only works when your engine is numba, which I also tried to modify in the apply method:

data.rolling(window='2D', on='time', method='table').apply(outlier, engine='numba', raw=True)

And now an issue arise with scikit-learn's compatibility with numba, with this error message Untyped global name 'IsolationForest': Cannot determine Numba type of <class 'abc.ABCMeta'>. So I'm now wondering, is there a way to run a scikit-learn's model with rolling functions, either by not using the 'table' method, or by somehow making it compatible with numba.

Upvotes: 0

Views: 79

Answers (0)

Related Questions