mon
mon

Reputation: 22254

How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?

Question

How does the TensorFlow dataset handle large data that cannot fit into the memory in a server?

Spark RDD can handle large large data with multiple nodes. For the question in Tensorflow Transform: How to find the mean of a variable over the entire dataset, the answer is using Tensorflow Transform which uses Apache Beam that requires a distributed computation cluster such as Spark.

if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics.

Hence I suppose TensorFlow requires a multi node cluster but not clear if TensorFlow has its own cluster implementation, or re-using existing technologies. Since TensorFlow pre-processing e.g. getting mean or std of a column requires Apache Beam, I guess it is Apache Beam based too, but not sure.

A google paper Large-Scale Machine Learning on Heterogeneous Distributed Systems shows multiple workers. enter image description here

This article TensorFlow: A new paradigm for large scale ML in distributed systems tells the system components.

In terms of system components, TensorFlow consists of Master, Worker and Client for distributed coordination and execution.

This Github TensorFlow2-tutorial/05-distributed-training/ tells TF_CONFIG specifying the node IP/port.

TF_CONFIG='{"cluster": {"worker": ["10.1.10.58:12345", "10.1.10.250:12345"]}, "task": {"index": 0, "type": "worker"}}' python worker.py

TensorFlow example Github Distributed TensorFlow has the section below but do not see node setup detail.

Create a tf.train.ClusterSpec to describe the cluster

Hence apparently there is a way to setup TensorFlow cluster which I suppose handles large dataset loading into a TF dataset.

However, Install TensorFlow 2 only shows:

# Current stable release for CPU and GPU
pip install tensorflow

Please point to the step by step documentation of how to setup a TensorFlow multi node cluster, and resources that explain the details on how the large data loading is handled (Similar to the Spark RDD/DataFrame explanation and internals) in TF.

Upvotes: 1

Views: 481

Answers (1)

Isaak Eriksson
Isaak Eriksson

Reputation: 653

You need to use generator functions that pull in chunked data. Each chunk that took want to send is through a **yield ** operation. Tensorflow allows one to create a Dataset that returns Tensors as input yielded by a generator function. This dataset is finally viewed by the .fit methods as follows:

import itertools

def gen():
  for i in itertools.count(1):
    yield (i, [1] * i)

dataset = tf.data.Dataset.from_generator(
     gen,
     (tf.int64, tf.int64),
     (tf.TensorShape([]), tf.TensorShape([None])))

list(dataset.take(3).as_numpy_iterator())

train(dataset, max_steps=100)

This approach has several benefits:

  • it limits the usage of RAM when training (to the size of chunks)
  • it allows one to stream asynchronously (like from a large file, remote database, web scraping bot, etc.)

Upvotes: 1

Related Questions