Batching in tf.data.dataset in time-series analysis

I'm looking at creating a pipeline for a time-series LSTM model. I have two feeds of inputs, lets call them series1 and series2.

I initialize the tf.data object by calling from.tensor.slices:

ds = tf.data.Dataset.from_tensor_slices((series1, series2))

I batch them further into windows of a set windows size and shift 1 between windows:

ds = ds.window(window_size + 1, shift=1, drop_remainder=True)

At this point I want to play around with how they are batched together. I want to produce a certain input like the following as an example:

series1 = [1, 2, 3, 4, 5]
series2 = [100, 200, 300, 400, 500]

batch 1: [1, 2, 100, 200]
batch 2: [2, 3, 200, 300]
batch 3: [3, 4, 300, 400]

So each batch will return two elements of series1 and then two elements of series2. This code snippet does not work to batch them separately:

ds = ds.map(lambda s1, s2: (s1.batch(window_size + 1), s2.batch(window_size + 1))

Because it returns two mapping of dataset objects. Since they are objects they are not subscriptible, so this does not work either:

ds = ds.map(lambda s1, s2: (s1[:2], s2[:2]))

I'm sure the solution is some utilization of .apply with a custom lambda function. Any help is much appreciated.

Edit

I am also looking at producing a label that represents the next element of the series. So for example, the batches will produce the following:

batch 1: (tf.tensor([1, 2, 100, 200]), tf.tensor([3]))
batch 2: (tf.tensor([2, 3, 200, 300]), tf.tensor([4]))
batch 3: (tf.tensor([3, 4, 300, 400]), tf.tensor([5]))

Where [3], [4] and [5] represent the next elements of series1 to be predicted.

Upvotes: 6

Answers (3)

Brown Owl

Reputation: 1

Here is my solution when dealing with time series data.

dataset = tf.data.Dataset.from_tensor_slices(series)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
dataset = dataset.batch(batch_size).prefetch(1)

Following line is important to split the window into xs and ys.

dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))

Though it is not important to use shuffle, you can only use the map function to split the window in to xs and ys.

Upvotes: 0

Jamie Dimon

Reputation: 477

The solution was to window the two datasets separately, .zip() them together, then .concat() the elements to include the label.

ds = tf.data.Dataset.from_tensor_slices(series1)
ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda window: window.batch(window_size + 1))
ds = ds.map(lambda window: (window[:-1], window[-1]))

ds2 = tf.data.Dataset.from_tensor_slices(series2)
ds2 = ds2.window(window_size, shift=1, drop_remainder=True)
ds2 = ds2.flat_map(lambda window: window.batch(window_size))

ds = tf.data.Dataset.zip((ds, ds2))
ds = ds.map(lambda i, j: (tf.concat([i[0], j], axis=0), i[-1]))

Returns:

(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  1,   2,   3, 100, 200, 300])>, <tf.Tensor: shape=(), dtype=int32, numpy=4>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  2,   3,   4, 200, 300, 400])>, <tf.Tensor: shape=(), dtype=int32, numpy=5>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  3,   4,   5, 300, 400, 500])>, <tf.Tensor: shape=(), dtype=int32, numpy=6>)

Upvotes: 3

Nicolas Gervais

Reputation: 36714

I think this is the line you're missing:

ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))

Full example:

import tensorflow as tf

series1 = tf.range(1, 16)
series2 = tf.range(100, 1600, 100)

ds = tf.data.Dataset.from_tensor_slices((series1, series2))

ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))

for row in ds:
    print(row)

tf.Tensor([  1   2 100 200], shape=(4,), dtype=int32)
tf.Tensor([  3   4 300 400], shape=(4,), dtype=int32)
tf.Tensor([  5   6 500 600], shape=(4,), dtype=int32)
tf.Tensor([  7   8 700 800], shape=(4,), dtype=int32)
tf.Tensor([   9   10  900 1000], shape=(4,), dtype=int32)
tf.Tensor([  11   12 1100 1200], shape=(4,), dtype=int32)
tf.Tensor([  13   14 1300 1400], shape=(4,), dtype=int32)

Upvotes: 1

Batching in tf.data.dataset in time-series analysis

Edit

Answers (3)

Returns:

Related Questions