csz-carrot
csz-carrot

Reputation: 275

Tensorflow loading data randomly slows down during training

I have loaded all the training data in the memory, which only consumes 7% of the total memory. And the following framework is used to train the model:

# build graph
......
# data producer
class DataProducer(object):
  # a single feature has multiple labels and is needed to be trained separately for each label
  # in order to not copy the features multiple times, I use the self.ft_idxs to index the relationships between features and labels
  def yield_trn_batch(self, batch_size):
    for i in xrange(0, self.num_data, batch_size):
      fts = self.fts[self.ft_idxs[self.shuffled_idxs[i: i+batch_size]]
      labels = self.labels[self.shuffled_idxs[i: i+batch_size]]
      yield fts, labels

# training
for feature, label in data.yield_trn_batch(batch_size):
  sess.run(model.train_op, feed_dict={model.feature: feature, model.label: label})

However, the training process randomly slows down when the dimensionality of the feature is high. The diagnoses are as following:

  1. The graph is predefined and is not the case in tensorflow slow performance.
  2. The actual running time for sess.run() is stable and the following is the timeline of a training batch, which seems normal. timeline of a training batch
  3. The slow part happened in data.yield_trn_batch(). At the begining, it took 0.01s to load one minibatch, but after several epoches it became unstable and sometimes took over 1s to load a minibatch. However, when I commented the sess.run() and purely run the data.yield_trn_batch(), it was fast as normal. I don't use the queue so it might not be the situation in dequeue many operation very slow.

I guess the graph running process have affected the data loading but don't know why and how to solve this (maybe using another thread to load data?). Can anybody fix the problem?

Upvotes: 0

Views: 863

Answers (1)

mrry
mrry

Reputation: 126154

Based on our conversation in the comments, it appears that the slow down is due to memory pressure caused by allocating a large number of NumPy arrays. Although the NumPy arrays are properly garbage collected properly when they are no longer used, the default malloc() implementation will not reuse them, and gradually increase the size of the heap (and the virtual size of the process), by calling the brk() system call.

One workaround is to switch the allocator library, which can fix address space leaks: use the tcmalloc allocator instead of the default malloc() for your TensorFlow process. The allocation policy in tcmalloc is better suited to allocating and recycling buffers of the same size repeatedly, and it will not need to increase the size of the head over time, which should lead to better performance.

Upvotes: 1

Related Questions