Reputation: 22254
How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?
Spark RDD can handle large large data with multiple nodes. For the question in Tensorflow Transform: How to find the mean of a variable over the entire dataset, the answer is using Tensorflow Transform which uses Apache Beam that requires a distributed computation cluster such as Spark.
if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics.
Hence I suppose TensorFlow requires a multi node cluster but not clear if TensorFlow has its own cluster implementation, or re-using existing technologies. Since TensorFlow pre-processing e.g. getting mean or std of a column requires Apache Beam, I guess it is Apache Beam based too, but not sure.
A google paper Large-Scale Machine Learning on Heterogeneous Distributed Systems shows multiple workers.
This article TensorFlow: A new paradigm for large scale ML in distributed systems tells the system components.
In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution.
This Github TensorFlow2-tutorial/05-distributed-training/ tells TF_CONFIG specifying the node IP/port.
TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python worker.py
TensorFlow example Github Distributed TensorFlow has the section below but do not see node setup detail.
Create a tf.train.ClusterSpec to describe the cluster
Hence apparently there is a way to setup TensorFlow cluster which I suppose handles large dataset loading into a TF dataset.
However, Install TensorFlow 2 only shows:
# Current stable release for CPU and GPU
pip install tensorflow
Please point to the step by step documentation of how to setup a TensorFlow multi node cluster, and resources that explain the details on how the large data loading is handled (Similar to the Spark RDD/DataFrame explanation and internals) in TF.
Upvotes: 1
Views: 481
Reputation: 653
You need to use generator functions that pull in chunked data. Each chunk that took want to send is through a **yield ** operation. Tensorflow allows one to create a Dataset
that returns Tensors as input yielded by a generator function. This dataset is finally viewed by the .fit methods as follows:
import itertools
def gen():
for i in itertools.count(1):
yield (i, [1] * i)
dataset = tf.data.Dataset.from_generator(
gen,
(tf.int64, tf.int64),
(tf.TensorShape([]), tf.TensorShape([None])))
list(dataset.take(3).as_numpy_iterator())
train(dataset, max_steps=100)
This approach has several benefits:
Upvotes: 1