Reputation: 755
There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset
API.
In my situation I have a file data.csv
with my features
, and would like to do the following two tasks:
Compute targets - the target at time t
is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h
to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window
:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas
or numpy
, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow
.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
Upvotes: 20
Views: 11180
Reputation: 53768
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset
from them is to convert them totf.Tensor
objects and useDataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict
or via Dataset
methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a
feed_dict
offers a high level of flexibility, in most instances usingfeed_dict
does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:# feed_dict often results in suboptimal performance when using large inputs sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict
input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.
Upvotes: 14