Alexy
Alexy

Reputation: 1100

Keras custom generator when batch_size doesn't match with amount of data

I'm using Keras with Python 2.7. I'm making my own data generator to compute batches for the train. I have some question about data_generator based on this model seen here:

class DataGenerator(keras.utils.Sequence):

def __init__(self, list_IDs, ...):
    #init

def __len__(self):
    return int(np.floor(len(self.list_IDs) / self.batch_size))

def __getitem__(self, index):
    indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
    # Find list of IDs
    list_IDs_temp = [self.list_IDs[k] for k in indexes]
    # Generate data
    X, y = self.__data_generation(list_IDs_temp)
    return X, y

def on_epoch_end(self):
    'Updates indexes after each epoch'
    self.indexes = np.arange(len(self.list_IDs))
    if self.shuffle == True:
        np.random.shuffle(self.indexes)

def __data_generation(self, list_IDs_temp):
    #generate data
    return X, y

Okay, so here are my several questions :

Can you confirm my thinking about the order of function called ? Here is :

- __init__
- loop for each epoc :
    - loop for each batches :
        - __len_
        - __get_item__ (+data generation)
    - on_epoch_end

If you know a way to debug the generator I would like to know it, breakpoint and prints aren't working with this..

More, I have a bad situation, but I think that everybody have the problem :

For example, I have 200 datas (and 200 labels ok) and I want a batch size of 64 for example. If I'm thinking well, __len_ will give 200/64 = 3 (instead of 3,125). So 1 epoch will be done with 3 batches ? What about the rest of the data ? I have an error because my amount of data is not a multiple of the batch size...

Second example, I have 200 data and I want a batch of 256 ? What I have to do in this case to adapt my generator ? I thought about checking if the batch_size is superior to my amount of data to feed the CNN with 1 batch, but the batch will not have the expected size so I thinks it will make an error ?

Thanks you for the reading. I prefer to put pseudo-code because my questions are more about theory than coding errors !

Upvotes: 2

Views: 3322

Answers (4)

Archil K Srivastava
Archil K Srivastava

Reputation: 31

You might be able to debug the generator if you pass "run_eagerly=True" in model.compile function. It says here:

Running eagerly means that your model will be run step by step, like Python code. Your model might run slower, but it should become easier for you to debug it by stepping into individual layer calls.

Upvotes: 0

J B
J B

Reputation: 440

The debugging aspect of this question sounds like the same question I tried posting recently and didn't get an answer. I eventually figured it out and I think it's a simple principle that is easily missed by beginners. You cannot break/debug at the level of the keras source code, if it's tensorflow-gpu underneath. That keras code got 'translated' to run on the gpu. I thought perhaps it would be possible to break if running tensorflow on the cpu, but no that's not possible either. There are ways to debug/break on the gpu at the tensorflow level, but that's gone beyond the simplicity of the high-level keras.

Upvotes: 1

mujjiga
mujjiga

Reputation: 16916

  • __len__ : Returns number of batches
  • __getitem__ : Returns i'th batch

Normally you never mention the batch size in the model architecture, because it is a training parameter not a model parameter. So it is OK to have different batch sizes while training.

Example

from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
from keras.utils import to_categorical
import keras

#create model
model = Sequential()
#add model layers
model.add(Conv2D(64, kernel_size=3, activation='relu', input_shape=(10,10,1)))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
class DataGenerator(keras.utils.Sequence):
    def __init__(self, X, y, batch_size):
        self.X = X
        self.y = y
        self.batch_size = batch_size

    def __len__(self):
        l = int(len(self.X) / self.batch_size)
        if l*self.batch_size < len(self.X):
            l += 1
        return l

    def __getitem__(self, index):
        X = self.X[index*self.batch_size:(index+1)*self.batch_size]
        y = self.y[index*self.batch_size:(index+1)*self.batch_size]
        return X, y

X = np.random.rand(200,10,10,1)
y = to_categorical(np.random.randint(0,2,200))
model.fit_generator(DataGenerator(X,y,13), epochs=10)

Output:

Epoch 1/10 16/16 [==============================] - 0s 2ms/step - loss: 0.6774 - acc: 0.6097

As you can see it has run 16 batches in one epoch i.e 13*15+5=200

Upvotes: 3

mxdbld
mxdbld

Reputation: 17755

Your generator is used in your python environment by Keras, if you cannot debug it, the reason is elsewhere.

cf : https://keras.io/utils/#sequence

__len__ : gives you the number of minibatches

__getitem__ : gives the ith minibatch

You don't have to know when or where they are called but more like this :

- __init__
- __len_
- loop for each epoc :
    - loop for each batches :
        - __get_item__
    - on_epoch_end

As for the minibatch size, you have two (classic) choices, either truncate or fill by picking again entries from your set. If you randomize your trainset every epoch as you should, there will be no overexposure or underexposure of some items over time

Upvotes: 0

Related Questions