sunxd
sunxd

Reputation: 751

how to creat a subset of sample from original size of mnist data, while keeping all 10 classes

suppose X,Y = load_mnist() where X and Y are the tensors that contain the whole mnist. Now i want a smaller proportion of the data to make my code run faster, but i need to keep all 10 classes there and also in a balanced manner. Is there an easy way to do this?

Upvotes: 0

Views: 2083

Answers (3)

Veera Srikanth
Veera Srikanth

Reputation: 486

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=Ture, test_size=0.33, random_state=42)

Stratify will ensure the proportion of classes.

If you want to perform K-Fold then

from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

for train_index, test_index in sss.split(X, y):
       print("TRAIN:", train_index, "TEST:", test_index)
       X_train, X_test = X.iloc[train_index], X.iloc[test_index]
       y_train, y_test = y.iloc[train_index], y.iloc[test_index]

check here for sklearn documentaion.

Upvotes: 0

kmario23
kmario23

Reputation: 61325

If you want to do this with more control, you could use numpy.random.randint to generate indices of size of the subset and sample the original arrays as in the following piece of code:

# input data, assume that you've 10K samples
In [77]: total_samples = 10000
In [78]: X, Y = np.random.random_sample((total_samples, 784)), np.random.randint(0, 10, total_samples)

# out of these 10K, we want to pick only 500 samples as a subset
In [79]: subset_size = 500

# generate uniformly distributed indices, of size `subset_size`
In [80]: subset_idx = np.random.choice(total_samples, subset_size)

# simply index into the original arrays to obtain the subsets
In [81]: X_subset, Y_subset = X[subset_idx], Y[subset_idx]

In [82]: X_subset.shape, Y_subset.shape
Out[82]: ((500, 784), (500,))

Upvotes: 0

JimmyOnThePage
JimmyOnThePage

Reputation: 965

scikit-learn's train_test_split is meant to split the data into train and test classes, but you can use it to create a "balanced" subset of your dataset using the stratified argument. You can just specify the train/test size proportion you desire and thereby obtain a smaller, stratified sample of your data. In your case:

from sklearn.model_selection import train_test_split

X_1, X_2, Y_1, Y_2 = train_test_split(X, Y, stratify=Y, test_size=0.5)

Upvotes: 1

Related Questions