Reputation: 412
What is the best technique for unbalanced dataset
?
I have a dataset
of 11967 instances where the number of positive labels is 139 and the number of negative labels is 11828 .
How to split the dataset
for testing (before or after the technique)?
Upvotes: 1
Views: 6447
Reputation: 216
The standard method for splitting the dataset using sklearn is given below:
#splitting the datasets into training and validation sets (60% training)
from sklearn.model_selection import train_test_split
xTrain, xVald, yTrain, yVald = train_test_split(Xs, y, train_size=0.60, random_state = 2)
where Xs and y are the predictors and response variables.
As you mentioned, your dataset has the imbalanced distribution of the classes. This distribution does not allow you to build the predictive model as the model treat your rare event (positive level) as the random noise and couldn't predict well for the new data set.
You may have to upsample the rare event to make it balanced in the distribution before building any predictive model. If you want to stick on the original distribution then you can run random forest model which works well for the imbalanced data as well. For more information, please see the following link: https://elitedatascience.com/imbalanced-classes
If you want to upsample to your data, then you can try this:
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df[df.pos_neg==0] #I classified negative class as '0'
df_minority = df[df.pos_neg==1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=11828, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.value_counts()
# 1 11828
# 0 11828
Upvotes: 1
Reputation: 171
I would suggest that you should use stratify method in sklearn.model_selection.train_test_split. If you set this statify = 'y' (y is the label of your data set), this will divide your data in such a way that train and test sets contain equal percentage of positive and negative samples. This is highly useful in unbalanced datasets. Instead of random division of dataset, it would consider the labels while dividing dataset into two parts.
Here is the sample code:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2,statify = y)
Refer to documentation for more information: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Upvotes: 1
Reputation: 8699
There are few good ways to handle imbalanced dataset:
Undersampling: it means taking the less number of majority class (in your case negative labels so that the new dataset will be balanced).
Oversampling: it means replicating the data of minority class (positive labels) in order to balance the dataset.
There is also a third way of handling imbalanced dataset, i.e. smote. Feel free to check out this link: https://www.analyticsvidhya.com/blog/2016/09/this-machine-learning-project-on-imbalanced-data-can-add-value-to-your-resume/
Upvotes: 4