Reputation: 117
I would like to train a deep neural network using fewer training data samples to reduce the time for testing my code. II wanted to know how to subset the Cifar-10 dataset using Keras TensorFlow.I have the following code which is training for Cifar-10 complete dataset.
#load and prepare data
if WhichDataSet == 'CIFAR10':
(x_train, y_train), (x_test, y_test) = tensorflow.keras.datasets.cifar10.load_data()
else:
(x_train, y_train), (x_test, y_test) = tensorflow.keras.datasets.cifar100.load_data()
num_classes = np.unique(y_train).shape[0]
K_train = x_train.shape[0]
input_shape = x_train.shape[1:]
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
y_train = tensorflow.keras.utils.to_categorical(y_train, num_classes)
y_test = tensorflow.keras.utils.to_categorical(y_test, num_classes)
Upvotes: 4
Views: 4502
Reputation: 16856
Create susbset based on labels
Create a subset of dataset excluding few labels. For example, to create a new train dataset with only first five class labels you can use below code
subset_x_train = x_train[np.isin(y_train, [0,1,2,3,4]).flatten()]
subset_y_train = y_train[np.isin(y_train, [0,1,2,3,4]).flatten()]
Create subset irrespective of labels
To create a 10% subset of train data you can use below code
# Shuffle first (optional)
idx = np.arange(len(x_train))
np.random.shuffle(idx)
# get first 10% of data
subset_x_train = x_train[:int(.10*len(idx))]
subset_y_train = y_train[:int(.10*len(idx))]
Repeat the same for x_test
and y_test
to get a subset of test data.
Upvotes: 6
Reputation: 2689
Use the pandas module to create a data frame and sample it accordingly.
import pandas as pd
(train_images1, train_labels), (test_images1, test_labels) = datasets.cifar10.load_data()
# Normalize pixel values to be between 0 and 1
train_images, test_images = train_images1 / 255.0, test_images1 / 255.0
#creating the validation set from the training set
df = pd.DataFrame(list(zip(train_images, train_labels)), columns =['Image', 'label'])
val = df.sample(frac=0.2)
X_train = np.array([ i for i in list(val['Image'])])
y_train = np.array([ [i[0]] for i in list(val['label'])])
The line val = df.sample(frac=0.2)
samples out 0.20 percent of the total data .
You can use val = df.sample(n=5000)
if you want a specific number of data records, by setting the n
value accordingly.
You can use random_state = 0
if you want the same results every time you run the code. eg:
val = df.sample(n=5000,random_state = 0)
Upvotes: 3