Reputation: 477
I'm looking at creating a pipeline for a time-series LSTM model. I have two feeds of inputs, lets call them series1
and series2
.
I initialize the tf.data
object by calling from.tensor.slices
:
ds = tf.data.Dataset.from_tensor_slices((series1, series2))
I batch them further into windows of a set windows size and shift 1 between windows:
ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
At this point I want to play around with how they are batched together. I want to produce a certain input like the following as an example:
series1 = [1, 2, 3, 4, 5]
series2 = [100, 200, 300, 400, 500]
batch 1: [1, 2, 100, 200]
batch 2: [2, 3, 200, 300]
batch 3: [3, 4, 300, 400]
So each batch will return two elements of series1 and then two elements of series2. This code snippet does not work to batch them separately:
ds = ds.map(lambda s1, s2: (s1.batch(window_size + 1), s2.batch(window_size + 1))
Because it returns two mapping of dataset objects. Since they are objects they are not subscriptible, so this does not work either:
ds = ds.map(lambda s1, s2: (s1[:2], s2[:2]))
I'm sure the solution is some utilization of .apply
with a custom lambda function. Any help is much appreciated.
I am also looking at producing a label that represents the next element of the series. So for example, the batches will produce the following:
batch 1: (tf.tensor([1, 2, 100, 200]), tf.tensor([3]))
batch 2: (tf.tensor([2, 3, 200, 300]), tf.tensor([4]))
batch 3: (tf.tensor([3, 4, 300, 400]), tf.tensor([5]))
Where [3]
, [4]
and [5]
represent the next elements of series1
to be predicted.
Upvotes: 6
Views: 1908
Reputation: 1
Here is my solution when dealing with time series data.
dataset = tf.data.Dataset.from_tensor_slices(series)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
dataset = dataset.batch(batch_size).prefetch(1)
Following line is important to split the window into xs and ys.
dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
Though it is not important to use shuffle, you can only use the map function to split the window in to xs and ys.
Upvotes: 0
Reputation: 477
The solution was to window the two datasets separately, .zip()
them together, then .concat()
the elements to include the label.
ds = tf.data.Dataset.from_tensor_slices(series1)
ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda window: window.batch(window_size + 1))
ds = ds.map(lambda window: (window[:-1], window[-1]))
ds2 = tf.data.Dataset.from_tensor_slices(series2)
ds2 = ds2.window(window_size, shift=1, drop_remainder=True)
ds2 = ds2.flat_map(lambda window: window.batch(window_size))
ds = tf.data.Dataset.zip((ds, ds2))
ds = ds.map(lambda i, j: (tf.concat([i[0], j], axis=0), i[-1]))
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([ 1, 2, 3, 100, 200, 300])>, <tf.Tensor: shape=(), dtype=int32, numpy=4>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([ 2, 3, 4, 200, 300, 400])>, <tf.Tensor: shape=(), dtype=int32, numpy=5>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([ 3, 4, 5, 300, 400, 500])>, <tf.Tensor: shape=(), dtype=int32, numpy=6>)
Upvotes: 3
Reputation: 36714
I think this is the line you're missing:
ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))
Full example:
import tensorflow as tf
series1 = tf.range(1, 16)
series2 = tf.range(100, 1600, 100)
ds = tf.data.Dataset.from_tensor_slices((series1, series2))
ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))
for row in ds:
print(row)
tf.Tensor([ 1 2 100 200], shape=(4,), dtype=int32)
tf.Tensor([ 3 4 300 400], shape=(4,), dtype=int32)
tf.Tensor([ 5 6 500 600], shape=(4,), dtype=int32)
tf.Tensor([ 7 8 700 800], shape=(4,), dtype=int32)
tf.Tensor([ 9 10 900 1000], shape=(4,), dtype=int32)
tf.Tensor([ 11 12 1100 1200], shape=(4,), dtype=int32)
tf.Tensor([ 13 14 1300 1400], shape=(4,), dtype=int32)
Upvotes: 1