Keras' predict_generator not returning correct number of samples

Question

I'm trying to implement a custom data generator that reads data from csv file(s) in chunks using pandas.read_csv. I tested it with model.predict_generator but the number of predictions returned is less than expected (in my case, 248192 out of 253457).

Custom generator

class TestDataGenerator:

def __init__(self, directory, batch_size=1024):
    self.directory = directory
    self.batch_size = batch_size
    self.chunk_size=10000
    self.samples = 0

def _to_movie_id(self, ids):
    ids = ast.literal_eval(ids)
    if ids == []:
        return [EMB_MATRIX_SIZE-1]
    else:
        return [movie2idx[str(movie_id)] for movie_id in ids]

def generate(self):
    csv_files = glob.glob(self.directory + '/*.csv')
    while True:
        for file in csv_files:
            df = pd.read_csv(file, chunksize=self.chunk_size)
            for df_chunk in df:
                chunk_steps = math.ceil(len(df_chunk) / self.batch_size)
                for i in range(chunk_steps):
                    batch = df_chunk[i * self.batch_size:(i + 1) * self.batch_size]
                    X_batch, y_batch = self.preprocess(batch)
                    self.samples += len(batch)
                    yield X_batch, y_batch


def preprocess(self, df):
    X_user = df['user'].apply(lambda x: user2idx[str(x)]).values
    X_watched = df['watched'].apply(self._to_movie_id).values
    X_watched_padded = pad_sequences(X_watched, maxlen=SEQ_LENGTH, value=0)

    ohe = df['movie'].apply(lambda x: to_categorical(movie2idx[x], num_classes=len(movie2idx)))
    X = [X_user, X_watched_padded]
    y = np.array([o.tolist() for o in ohe])

    return X, y

Run model.predict_generator

batch_size=1024
n_samples_test = 253457
test_dir = 'folder/'
test_gen = TestDataGenerator(test_dir, batch_size=batch_size)
next_test_gen = test_gen.generate()
preds = model.predict_generator(next_test_gen, steps=math.ceil(n_samples_test/batch_size))

After running model.predict_generator, the number of rows for preds is 248192 which is less than the actual 253457. It looks like it's missing a few number of epochs. I also tested generate individually without interacting with Keras and it behaved as expected returning the correct number of samples in csv file. Also, before the generate yields a value, I keep track of the number of samples processed with samples. Surprisingly, the value for samples is 250000. So, I'm pretty sure I might have done something with Keras.

Note that I also tried setting max_queue_size=1, and making generate thread-safe but got no luck. I placed only 1 csv file under test_dir for simplicity. I'm using Keras 2.1.2-tf embedded in Tensorflow 1.5.0.

I did some research on how this can be done but haven't come across a useful example yet. What is wrong with this implementation?

Thanks

Peeranat F.

Keras' predict_generator not returning correct number of samples

Answers (1)

Related Questions

Keras&#39; predict_generator not returning correct number of samples

Answers (1)

Related Questions

Keras' predict_generator not returning correct number of samples