Reputation: 165
I'm trying to implement a custom data generator that reads data from csv file(s) in chunks using pandas.read_csv
. I tested it with model.predict_generator
but the number of predictions returned is less than expected (in my case, 248192 out of 253457).
Custom generator
class TestDataGenerator:
def __init__(self, directory, batch_size=1024):
self.directory = directory
self.batch_size = batch_size
self.chunk_size=10000
self.samples = 0
def _to_movie_id(self, ids):
ids = ast.literal_eval(ids)
if ids == []:
return [EMB_MATRIX_SIZE-1]
else:
return [movie2idx[str(movie_id)] for movie_id in ids]
def generate(self):
csv_files = glob.glob(self.directory + '/*.csv')
while True:
for file in csv_files:
df = pd.read_csv(file, chunksize=self.chunk_size)
for df_chunk in df:
chunk_steps = math.ceil(len(df_chunk) / self.batch_size)
for i in range(chunk_steps):
batch = df_chunk[i * self.batch_size:(i + 1) * self.batch_size]
X_batch, y_batch = self.preprocess(batch)
self.samples += len(batch)
yield X_batch, y_batch
def preprocess(self, df):
X_user = df['user'].apply(lambda x: user2idx[str(x)]).values
X_watched = df['watched'].apply(self._to_movie_id).values
X_watched_padded = pad_sequences(X_watched, maxlen=SEQ_LENGTH, value=0)
ohe = df['movie'].apply(lambda x: to_categorical(movie2idx[x], num_classes=len(movie2idx)))
X = [X_user, X_watched_padded]
y = np.array([o.tolist() for o in ohe])
return X, y
Run model.predict_generator
batch_size=1024
n_samples_test = 253457
test_dir = 'folder/'
test_gen = TestDataGenerator(test_dir, batch_size=batch_size)
next_test_gen = test_gen.generate()
preds = model.predict_generator(next_test_gen, steps=math.ceil(n_samples_test/batch_size))
After running model.predict_generator
, the number of rows for preds
is 248192
which is less than the actual 253457
. It looks like it's missing a few number of epochs. I also tested generate
individually without interacting with Keras and it behaved as expected returning the correct number of samples in csv file. Also, before the generate
yields a value, I keep track of the number of samples processed with samples
. Surprisingly, the value for samples
is 250000. So, I'm pretty sure I might have done something with Keras.
Note that I also tried setting max_queue_size=1
, and making generate
thread-safe but got no luck. I placed only 1 csv file under test_dir
for simplicity. I'm using Keras 2.1.2-tf embedded in Tensorflow 1.5.0.
I did some research on how this can be done but haven't come across a useful example yet. What is wrong with this implementation?
Thanks
Peeranat F.
Upvotes: 5
Views: 2925
Reputation: 40516
Well, this is tricky. So let's dive into the problem:
How fit_generator
works when batch provided is less than batch_size
: As you may see - many batches you provide to fit_generator
are of the size less than batch_size
. This happens every time when you take the last batch from every file. Usually - a number of texts are not divisible by batch size so there are not enough texts to fill the batch. This ends up in feeding less examples to a model.
And here is a tricky part - keras
ignores less size, treats this as valid generator step and returns values for an incomplete batch.
So why there are texts missing: let me show you by example. Let's assume that you have 2 files with 5 texts each and your batch_size
is 4. This is how your batches would look like:
[1t1, 1t2, 1t3, 1t4], [1t5,], [2t1, 2t2, 2t3, 2t4], [2t5].
As you may see - the actual number of steps needed is equal to 4
which is not equal to 3
which is obtained by taking: math.ceil(10 / 4)
. This way is appropriate for these batches:
[1t1, 1t2, 1t3, 1t4], [1t5, 2t1, 2t2, 2t3], [2t4, 2t5]
But batches returned from your generator are not like these.
How to solve the problem? - you need to make your generator to compute the actual number of steps needed:
def steps_needed(self):
steps = 0
csv_files = glob.glob(self.directory + '/*.csv')
for file in csv_files:
df = pd.read_csv(file, chunksize=self.chunk_size)
for df_chunk in df:
chunk_steps = math.ceil(len(df_chunk) / self.batch_size)
steps += chunk_steps
return steps
This function computes exactly how many batches your generator will return.
Cheers :)
Upvotes: 8