Unbalanced dataset split

What is the best technique for unbalanced dataset?

I have a dataset of 11967 instances where the number of positive labels is 139 and the number of negative labels is 11828 .

How to split the dataset for testing (before or after the technique)?

Upvotes: 1

Answers (3)

Gaurav Sitaula

Reputation: 216

The standard method for splitting the dataset using sklearn is given below:

#splitting the datasets into training and validation sets (60% training)
from sklearn.model_selection import train_test_split
xTrain, xVald, yTrain, yVald = train_test_split(Xs, y, train_size=0.60, random_state = 2)

where Xs and y are the predictors and response variables.

As you mentioned, your dataset has the imbalanced distribution of the classes. This distribution does not allow you to build the predictive model as the model treat your rare event (positive level) as the random noise and couldn't predict well for the new data set.

You may have to upsample the rare event to make it balanced in the distribution before building any predictive model. If you want to stick on the original distribution then you can run random forest model which works well for the imbalanced data as well. For more information, please see the following link: https://elitedatascience.com/imbalanced-classes

If you want to upsample to your data, then you can try this:

from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df[df.pos_neg==0] #I classified negative class as '0'
df_minority = df[df.pos_neg==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,       # sample with replacement
                                 n_samples=11828,    # to match majority class
                                 random_state=123)   # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.value_counts()
# 1    11828
# 0    11828

Upvotes: 1

Aditya Kansal

Reputation: 171

I would suggest that you should use stratify method in sklearn.model_selection.train_test_split. If you set this statify = 'y' (y is the label of your data set), this will divide your data in such a way that train and test sets contain equal percentage of positive and negative samples. This is highly useful in unbalanced datasets. Instead of random division of dataset, it would consider the labels while dividing dataset into two parts.

Here is the sample code:

 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2,statify = y)

Refer to documentation for more information: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Upvotes: 1

Anubhav Singh

Reputation: 8699

There are few good ways to handle imbalanced dataset:

Undersampling: it means taking the less number of majority class (in your case negative labels so that the new dataset will be balanced).

Oversampling: it means replicating the data of minority class (positive labels) in order to balance the dataset.

There is also a third way of handling imbalanced dataset, i.e. smote. Feel free to check out this link: https://www.analyticsvidhya.com/blog/2016/09/this-machine-learning-project-on-imbalanced-data-can-add-value-to-your-resume/

Upvotes: 4

Unbalanced dataset split

Answers (3)

Related Questions