Reputation: 685
I have collected data on how long it takes for a product to be released in a release pipeline. 95% of the data so far has taken <400 minutes [outlier = 0]. Then 5% of the data is between [700,40 000] minutes [outlier = 1]. I want to build a classifier using xgboost which predicts if an event will be an "outlier" or not. The thing is, it is very uncommon with outliers and I have about 200 datapoints which are outliers and 3200 datapoints which are not.
Currently, without tuning, my model can predict 98% of [outlier = 0] cases and 67% of [outlier = 1]. It is important for me that the model does not perform worse on detecting [outlier = 0] since 95% of the data is in this set, but I want to see if I still can tune the model to increase performance on detecting [outlier = 1].
So I have two variables :
ratio_wrong_0 = len(predicted_wrong_0) / len(true_0)
ratio_wrong_1 = len(predicted_wrong_1) / len(true_1)
So I want to keep ratio_wrong_0 below 5% and minimize ratio_wrong_1 at the same time. Anyone have any idea how I could construct such a metric for evaluation during tuning my parameters?
Upvotes: 1
Views: 70
Reputation: 33
First, if you keep the dataset as is, you will most likely always have a tendency to under predict the [outlier = 1] class, since it is better performance-wise to predict [outlier = 0] if unknown, which you seem to understand.
There are a few simple things you can do :
Under sampling of over represented class : Given you have 200 [outlier = 1], you could take at random 200-ish [outlier = 0]. However, it seems that the number of data would be too low. It is easily implemented though, so you might want to give it a try.
Over sampling of under represented class : The exact opposite, where you basically copy/paste data from [outliers = 1] to get roughly the same number of occurences.
These methods are usually considered equivalent, however in your case, I think Over sampling would result in over fitting. Since both class don't have the same range of possible values, and on a range of [700, 4000], 200 data points is not enough for proper generalization.
Now to get into more advanced stuff, you could try bootstrpping. For more details on this technique, see Bootstrap re-sampling for unbalanced data in supervised learning, by Georges Dupret and Masato Koda for the methodology. This could work, and you could use sklearn.utils.resample for this. I find this tutorial pretty good.
Bootstrapping is a resampling method, that way you can use multiple balanced datasets for the training. You have to be careful about overfitting though.
About the metrics used, you want to either use AUROC, ROC or Precision/Recall. You can read a nice article on what metrics to use for unbalanced datasets.
Finally, you could use Penalize Algorithms, which essentialy makes it that a mistake on least represented classe (here [outlier = 1]) is more costly. It is used sometimes in medical applications, where you would rather have a patient diagnosed as sick by mistake than the opposite.
This great article that sums it all up is a must read.
Upvotes: 1