ira
ira

Reputation: 755

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.

In my situation I have a file data.csv with my features, and would like to do the following two tasks:

  1. Compute targets - the target at time t is the percent change of some column at some horizon, i.e.,

    labels[i] = features[i + h, -1] / features[i, -1] - 1
    

    I would like h to be a parameter here, so I can experiment with different horizons.

  2. Get rolling windows - for training purposes, I need to roll my features into windows of length window:

    train_features[i] = features[i: i + window]
    

I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.

Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?

Upvotes: 20

Views: 11180

Answers (1)

Maxim
Maxim

Reputation: 53768

First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:

If all of your input data fit in memory, the simplest way to create a Dataset from them is to convert them to tf.Tensor objects and use Dataset.from_tensor_slices()

A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":

While feeding data using a feed_dict offers a high level of flexibility, in most instances using feed_dict does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:

# feed_dict often results in suboptimal performance when using large inputs  
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.

The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

Upvotes: 14

Related Questions