python: taking random sample from data but keeping the same distribution

Question

I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)

The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.

Thanks

shivsn · Accepted Answer

You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class. See the documentation for using it.

python: taking random sample from data but keeping the same distribution

Answers (2)

Related Questions