Reputation: 898
I have a multi-class problem with highly imbalanced data.
Their is one large majority class with a few thousand members, some classes with 100-1000 members, and 10-30 classes with only 1 member.
Sampling isn't possible because it could lead to a wrong weight of the classes.
To evaluate my model I want to use cross validation. I tried cross_val_predict(x,y, cv=10)
which lead to the error-code:
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=10.
I tried to build my own cross-validation, which is pretty straight forward.
I split my data via StratifiedKFold and then did the following:
clf = DecisionTreeClassifier()
for ta, te in splits
xTrain, xTest = x.iloc[ta], x.iloc[te]
yTrain, yTest = y.iloc[ta], y.iloc[te]
clf.fit(xTrain, yTrain)
prediction = clf.predict(xTest)
cnf_matrix[ta] = confusion_matrix(yTest, prediction)
classRepo[ta] = classification_report(y, prediction)
Because I am working in jupyter notebook I have to print every position of the cnf_matrix
and classRepo
by hand and go through it by myself.
Is there a more elegant solution like fusing the classRepo
and cnf_matrix
by hand, so that I can get the same result as cross_val_predict(x,y, cv=x)
offers?
Is there a better metric to tackle my problem?
Upvotes: 0
Views: 347
Reputation: 16109
"Sampling isn't possible because it could lead to a wrong weight of the classes."
That is a strong assertion as you are assuming that your training data is a perfect representation of all remaining, an future observable data. If I was on your team, I would challenge you to support that hypothesis with experimental data.
There are in fact many approaches developed specifically for dealing with minority class imbalances. For example SMOTE and ADASYN. I would point you towards, imbalanced learn for a python package that implements these and other techniques within the sklearn framework.
Upvotes: 1