Reputation: 2544
I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)
The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.
Thanks
Upvotes: 3
Views: 4776
Reputation: 7828
You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class. See the documentation for using it.
Upvotes: 5
Reputation: 409
The train_test_split function allows a definition of the size of the training data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
Upvotes: 1