Reputation: 25
I build an anomaly detection model using Isolation Forest with default setting for the contamination paramter (0.1). It works quite good on my current data set, but now I have different files with the same structure but different row count and once I run the model I don't get accurate results anymore without manually adjusting the contamination parameter through playing around until it fits.
I would like to run the model automatically as soon as I get a new file, but the percentage of outliers in my data varies in each file and it's not possbile to get good results since I always have to change the contamination parameter. Is there a way to calculate a new parameter every time a new file arrives or is this model not suitable for my use case?
Upvotes: 0
Views: 1034
Reputation: 6299
The contamination
parameter is a hyperparameter. It can be tuned with hyperparameter optimization. Typical approach in scikit-learn with small models/dataset would be to use gridsearch, see the user guide. This assumes that you have a robust quantitive way of evaluating your model performance.
Upvotes: 1