codlix
codlix

Reputation: 898

Custom Cross Validating and Validation with extremly imbalanced classes

I have a multi-class problem with highly imbalanced data.

Their is one large majority class with a few thousand members, some classes with 100-1000 members, and 10-30 classes with only 1 member.

Sampling isn't possible because it could lead to a wrong weight of the classes.

To evaluate my model I want to use cross validation. I tried cross_val_predict(x,y, cv=10) which lead to the error-code:

Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=10.

I tried to build my own cross-validation, which is pretty straight forward.

I split my data via StratifiedKFold and then did the following:

clf = DecisionTreeClassifier()

for ta, te in splits
    xTrain, xTest = x.iloc[ta], x.iloc[te]
    yTrain, yTest = y.iloc[ta], y.iloc[te]
    clf.fit(xTrain, yTrain)
    prediction = clf.predict(xTest)
    cnf_matrix[ta] = confusion_matrix(yTest, prediction)
    classRepo[ta] = classification_report(y, prediction) 

Because I am working in jupyter notebook I have to print every position of the cnf_matrix and classRepo by hand and go through it by myself.

Is there a more elegant solution like fusing the classRepo and cnf_matrix by hand, so that I can get the same result as cross_val_predict(x,y, cv=x) offers?

Is there a better metric to tackle my problem?

Upvotes: 0

Views: 347

Answers (1)

Grr
Grr

Reputation: 16109

"Sampling isn't possible because it could lead to a wrong weight of the classes."

That is a strong assertion as you are assuming that your training data is a perfect representation of all remaining, an future observable data. If I was on your team, I would challenge you to support that hypothesis with experimental data.

There are in fact many approaches developed specifically for dealing with minority class imbalances. For example SMOTE and ADASYN. I would point you towards, imbalanced learn for a python package that implements these and other techniques within the sklearn framework.

Upvotes: 1

Related Questions