Reputation: 1032
I use the following simple IsolationForest algorithm to detect the outliers of given dataset X
of 20K
samples and 16
features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X
when fitting into IsolationForest
or only train_X
?
Upvotes: 0
Views: 1181
Reputation: 1629
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.
Upvotes: 1