Manoj
Manoj

Reputation: 99

Splitting training data with equal number rows for each classes

I have a very large dataset of about 314554097 rows and 3 columns. The third column is the class. The dataset has two class 0 and 1. I need split the data into test and training data. To split the data I can use

from sklearn.cross_validation import train_test_split . 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.75, random_state = 0)  

But, The dataset contains about 99 percent of class 0 and only 1 percent of class 1. In the training dataset, I need an equal number of class 0 and class 1 say 30000 rows of both classes. How can I do it?

Upvotes: 1

Views: 3192

Answers (2)

Jamie
Jamie

Reputation: 47

When I tried the folowing:

X_train_sample_class_1 = X_train[X_train['third_column_name'] == 1][:30000]
X_train_sample_class_0 = X_train[X_train['third_column_name'] == 0][:30000]

the data frames are empty without values. How can I split with values, please?

Upvotes: 0

Kalsi
Kalsi

Reputation: 587

You may be searching for solutions to handle imbalanced data. Here are some of methods you can follow.

  1. Resampling: (Over sampling of minority class data points or Under sampling of majority class data points)

    In your case, class 1 is minority class

  2. Giving more weightage to minority class depending on the ratio of class imbalance
  3. Choose right performance metric.

But still if you need 30k of class 1 & class 0 data points, try this:

X_train_sample_class_1 = X_train[X_train['third_column_name'] == 1][:30000]
X_train_sample_class_0 = X_train[X_train['third_column_name'] == 0][:30000]

Now you can combine X_train_sample_class_1 & X_train_sample_class_0 to form a new dataset which has balanced dataset

Upvotes: 4

Related Questions