Max Shek-wai Chu
Max Shek-wai Chu

Reputation: 21

How to do distributed training with distributing only the input pipeline in Tensorflow?

Currently i have 4 1080 GPUs in my machine and i have a pretty power CPU for doing my image classification project. However, since my model is very small but my training data is very large (cannot fit the whole dataset in the memory), i have to read and process batch of samples dynamically. Now i found out that the GPUs is only utilised at around 50% while all my CPU cores are fully utilised.

So one of the solution is that i would like to split my input pipeline (I am using tf.data.dataset as my input pipeline) to another machine(s) without GPUs to speed up the input pipeline so that i can get more utilisation on the GPUs. There are two options i can distribute my input pipeline: 1) Just distribute the data augmentation and so one machines read all the raw image and send to another machine and then send it back to the machines with GPUs to train. 2) Just copy the whole/part of dataset to all CPU-only machines, and they independently process its own input pipeline and send it back the machines with GPUs to train.

I think it will be much easier to implement the option 2). I dont have experience with coding distributed training on different machine. All the example i read online are all about distributed training on multiple machines with its own GPUs. In my case, since i am only going to distributed the input pipeline alone, is there any simpler implementation examples for that purpose?

Upvotes: 1

Views: 58

Answers (0)

Related Questions