Reputation: 664
Theoretical question here. I understand that when dealing with datasets that cannot fit into memory on a single machine, spark + EMR is a great way to go.
However, I would also like to use tensorflow instead of spark's ml lib algorithms to perform deep learning on these large datasets.
From my research I see that I could potentially use a combination of pyspark, elephas and EMR to achieve this. Alternatively there is BigDL and sparkdl.
Am I going about this the wrong way? What is best practice for deep learning on data that cannot fit into memory? Should I use online learning or batch training instead? This post seems to say that "most high-performance deep learning implementations are single-node only"
Any help to point me in the right direction would be greatly appreciated.
Upvotes: 1
Views: 302
Reputation: 37
In TensorFlow, you can use tf.data.Dataset.from_generator so you can generate your dataset at runtime without any storage hassles. See link for example https://www.codespeedy.com/what-is-tf-data-dataset-from_generator-in-tensorflow/
Upvotes: 1
Reputation: 2194
As you mention "fitting massive dataset to memory", I understand that you are trying to load all data to memory at once and start training. Hence, I give the reply based on this assumption.
General mentality is that if you cannot fit the data to your resources, divide data into smaller chunks and train in an iterative way.
1- Load data one by one instead of trying to load all at once. If you create an execution workflow as "Load Data -> Train -> Release Data (This can be done automatically by garbage collectors) -> Restart" , you can understand how much resource is needed to train single data.
2- Use mini-batches. As soon as you get the resource information from #1, you can make an easy calculation to estimate the mini-batch size. For example, if training single data consumes 1.5 GB of RAM, and your GPU has 8 GB of RAM, theoretically you may train mini-batches with size 5 at once.
3- If the resources are not enough to train even 1-sized single batch, in this case, you may think about increasing your PC capacity or decreasing your model capacity / layers / features. Alternatively, you can go for cloud computing solutions.
Upvotes: 0