Ziqi
Ziqi

Reputation: 2544

python: taking random sample from data but keeping the same distribution

I have a training data that has 20,000 and more instances, split into 3 classes, with a distribution like A=10%, B=20%, C=70%. Is there a way in sklearn or pandas or anything else that can take a sample of 10% from this data but at the same time respecting the distribution of different classes? As I need do grid search on the data but the original dataset is too high dimensional (20,000 x 12,000 feature dimension)

The train_test_split will keep the distribution but it only splits the entire dataset into two sets, which are still too large.

Thanks

Upvotes: 3

Views: 4776

Answers (2)

shivsn
shivsn

Reputation: 7828

You should use Stratifiefkfold. The folds are made by preserving the percentage of samples for each class. See the documentation for using it.

Upvotes: 5

John Damen
John Damen

Reputation: 409

The train_test_split function allows a definition of the size of the training data:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

See the docs

Upvotes: 1

Related Questions