Reputation: 751
suppose X,Y = load_mnist()
where X and Y are the tensors that contain the whole mnist. Now i want a smaller proportion of the data to make my code run faster, but i need to keep all 10 classes there and also in a balanced manner. Is there an easy way to do this?
Upvotes: 0
Views: 2083
Reputation: 486
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)
Stratify will ensure the proportion of classes.
If you want to perform K-Fold then
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
check here for sklearn documentaion.
Upvotes: 0
Reputation: 61325
If you want to do this with more control, you could use numpy.random.randint
to generate indices of size of the subset and sample the original arrays as in the following piece of code:
# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)
# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500
# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)
# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]
In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))
Upvotes: 0
Reputation: 965
scikit-learn's train_test_split
is meant to split the data into train and test classes, but you can use it to create a "balanced" subset of your dataset using the stratified
argument. You can just specify the train/test size proportion you desire and thereby obtain a smaller, stratified sample of your data. In your case:
from sklearn.model_selection import train_test_split
X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)
Upvotes: 1