Rebecca Schley
Rebecca Schley

Reputation: 41

Key error in shuffled mini-batches for CNN implementation

I am trying to modify a CNN trained on MNIST data to work with my data. I'm not very familiar with Python's data structures and have hit a roadblock. Any suggestions for resolving this issue are appreciated.

This is the batching code that is causing a key error issue.

from __future__ import division, print_function, absolute_import
import numpy as np
import pandas as pd

# training data modified to add column labels; label, pixel0, pixel1, ...
train = pd.read_csv("/Users/rebeccaschley/data/train-data3.txt")
# extract left-most columns containing labels from each data set
a = train.label
# convert to np arrays in order to use reshape method
train_labels = a.to_frame()
# remove label column and extract pixel values only from training and testing data
train_pixels = train.drop('label', 1).values

IMG_SIZE=16
x_trainr = np.array(train_pixels).reshape(-1, IMG_SIZE,IMG_SIZE,1)

batch_size = 250  
    
# for training
def batch_data(source, target, batch_size):

    # Shuffle data
    shuffle_indices = np.random.permutation(np.arange(len(target)))
    source = source[shuffle_indices]
    target = target[shuffle_indices]

    for batch_i in range(0, len(source)//batch_size):
        start_i = batch_i * batch_size
        source_batch = source[start_i:start_i + batch_size]
        target_batch = target[start_i:start_i + batch_size]

        yield np.array(source_batch), np.array(target_batch)
                   
batch_x, batch_y = batch_data(x_trainr, train_labels, batch_size)

My training data consists of 16 x 16 images represented with labels in the first column and 256 columns of pixel values. Here is a link to the data.

This is the error I get:

Traceback (most recent call last):

  File "/Users/rebeccaschley/.spyder-py3/test.py", line 34, in <module>
    batch_x, batch_y = batch_data(x_trainr, train_labels, batch_size)

  File "/Users/rebeccaschley/.spyder-py3/test.py", line 25, in batch_data
    target = target[shuffle_indices]

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 2908, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)

  File "/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")

KeyError: "None of [Int64Index([5912, 2709, 3697, 3110, 3859, 5057, 1497, 5910, 5715, 1578,\n            ...\n             216,  171, 6031,  785, 4474, 3166, 2547, 5418, 3759, 4723],\n           dtype='int64', length=6292)] are in the [columns]"

Upvotes: 1

Views: 423

Answers (1)

n1colas.m
n1colas.m

Reputation: 3989

Convert the train.label and train_pixels to a numpy array with to_numpy, this will allow you to use reshape directly over train_pixels. On your batch_data function change np.arange(len(target)) to np.arange(target.shape[0]) to generate the batches using the maximum number of lines in the file. No need to use np.array in the yield command as the variables still remain as numpy arrays. Next, use the generator to obtain the batches of image ((250, 16, 16, 1)) and labels ((250,)).

from __future__ import division, print_function, absolute_import
import numpy as np
import pandas as pd

# reading CSV file
train = pd.read_csv("train-data3.txt")
print(train.shape)
# (6292, 257)

# convert to numpy array
train_labels = train.label.to_numpy()
train_pixels = train.drop('label', 1).to_numpy()

IMG_SIZE=16
x_trainr = train_pixels.reshape(-1, IMG_SIZE,IMG_SIZE,1)

batch_size = 250

def batch_data(source, target, batch_size):

    # use target shape[0] to get number of lines
    shuffle_indices = np.random.permutation(np.arange(target.shape[0]))
    source = source[shuffle_indices]
    target = target[shuffle_indices]

    for batch_i in range(0, len(source)//batch_size):
        start_i = batch_i * batch_size
        source_batch = source[start_i:start_i + batch_size]
        target_batch = target[start_i:start_i + batch_size]

        yield source_batch, target_batch

batch_output = batch_data(x_trainr, train_labels, batch_size)

for img, label in batch_output:
    print(img.shape)
    print(label.shape)

Output

(250, 16, 16, 1)
(250,)
(250, 16, 16, 1)
(250,)
...
...

Upvotes: 0

Related Questions