Reputation: 3041
I have a highly unbalanced dataset.
My dataset contains 1450 records and my outputs are binary 0 and 1. Output 0 has 1200 records and the 1 has 250 records.
I am using this piece of code to build my testing and training data set for the model.
from sklearn.cross_validation import train_test_split
X = Actual_DataFrame
y = Actual_DataFrame.pop('Attrition')
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=42, stratify=y)
But what I would like is a way through a function in which I want to specify the number of records for training and how much percent of them needs to come from class '0' and how much percent of them needs to come from class '1'.
So, a function which takes 2 Inputs are needed for creating the training_data:-
Total Number of Records for Training Data,
Number of Records that belongs to Class '1'
This would be a huge help to solve biased sampling dataset problems.
Upvotes: 0
Views: 545
Reputation: 9018
You can simply write a function that's very similar to the train_test_split
from sklearn
. The idea is that, from the input parameters train_size
and pos_class_size
, you can calculate how many positive class sample and negative class sample you will need.
def custom_split(X, y, train_size, pos_class_size, random_state=42):
neg_class_size = train_size = pos_class_size
pos_df = X[y == 1]
neg_df = X[y == 0]
pos_train = pos_df.sample(pos_class_size)
pos_test = pos_df[~pos_df.index.isin(pos_train.index)]
neg_train = neg_df.sample(neg_class_size)
neg_test = neg_df[~neg_df.index.isin(neg_train.index)]
X_train = pd.concat([pos_train,neg_train], axis=1)
X_test = pd.concat([pos_test,neg_test], axis=1)
y_train = y[X_train.index]
y_test = y[X_test.index]
return X_train, X_test, y_train, y_test
There are methods that are memory efficient or runs quicker, I didn't do any test with this code, but it should work.
At least, you should be able to get the idea behind.
Upvotes: 1