When to use tensorflow datasets api versus pandas or numpy

Question

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.

In my situation I have a file data.csv with my features, and would like to do the following two tasks:

Compute targets - the target at time t is the percent change of some column at some horizon, i.e.,
```
labels[i] = features[i + h, -1] / features[i, -1] - 1
```
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
```
train_features[i] = features[i: i + window]
```

I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.

Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?

Maxim · Accepted Answer

First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices()

A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":

While feeding data using a feed_dict offers a high level of flexibility, in most instances using feed_dict does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs  
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.

The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

When to use tensorflow datasets api versus pandas or numpy

Answers (1)

Related Questions