Finding the correct shape for tf.data.Dataset feeding into LSTM with Masking

Question

I'm doing a time series classification task and I'm having trouble finding the correct way to use tf.data.Dataset. I can create a working model using numpy arrays as in the code below (the data has already been pre-padded with zeros to a max length of 180):

tf_data.shape, labels.shape
> ((225970, 180, 1), (225970,))

So I have 225970 instances of 180 time steps with 1 feature. And I can fit the model as follows. This works fine and creates the appropriate output/predictions:


model = keras.Sequential(
    [
         layers.Masking(mask_value=0, input_shape=(180,1)),
         layers.LSTM(16),
         layers.Dense(1, activation='sigmoid')
    ]
)

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy']
)

model.fit(tf_data,  labels, epochs=5)

However, when I try the following using tf.data.Dataset.from_tensors:

tf_dataset = tf.data.Dataset.from_tensors((tf_data, labels.astype(int)))

tf_dataset = tf_dataset.shuffle(buffer_size=1024).batch(64)

model_v1 = keras.Sequential(
    [
        layers.Input(shape=(180,1), batch_size=64),
        layers.Masking(mask_value=0),
        layers.LSTM(16),
        layers.Dense(1, activation='sigmoid')
    ]
)
model_v2 = keras.Sequential(
    [
        layers.Masking(mask_value=0, shape=(180,1), batch_size=64),
        layers.LSTM(16),
        layers.Dense(1, activation='sigmoid')
    ]
)

.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy']
)

.fit(tf_dataset, epochs=10, steps_per_epoch=30)

I'm met with the errors

model_v1 - ValueError: Error when checking input: expected input_48 to have 3 dimensions, but got array with shape (None, 225970, 180, 1)
model_v2 - ValueError: Error when checking input: expected masking_70_input to have 3 dimensions, but got array with shape (None, 225970, 180, 1)

Can anyone explain what I need to do to either my tensorflow Dataset or to my model to ensure it will work with the tf.data.Dataset and not just the numpy array?

Shubham Panchal · Accepted Answer

You should use tf.data.Dataset.from_tensor_slices instead,

tf_dataset = tf.data.Dataset.from_tensor_slices((tf_data, labels.astype(int)))

As per the official documentation of the method tf.data.Dataset.from_tensors,

from_tensors produces a dataset containing only a single element. To slice the input tensor into multiple elements, use from_tensor_slices instead.

While the official docs for tf.data.Dataset.from_tensor_slices says,

The given tensors are sliced along their first dimension. This operation preserves the structure of the input tensors, removing the first dimension of each tensor and using it as the dataset dimension. All input tensors must have the same size in their first dimensions.

The *.from_tensor_slices method slices the input tensor and considers the given array ( the tensor in our case ) as 225970 data instances. This allows to call methods like *.shuffle and *.batch() which shuffle and batch data instances respectively.

We can understand this with an example. First let's generate some dummy data instances.

x = np.random.rand( 2500 , 180 , 1 )
y = np.random.rand( 2500 , 1 )

With *.from_tensors,

tensor_ds = tf.data.Dataset.from_tensors( ( x , y ) )
for sample in tensor_ds.take( 1 ):
    print( sample[ 0 ].shape )
    print( sample[ 1 ].shape )

The output is,

(2500, 180, 1)
(2500, 1)

And with *.from_tensor_slices,

tensor_ds = tf.data.Dataset.from_tensor_slices( ( x , y ) )
for sample in tensor_ds.take( 1 ):
    print( sample[ 0 ].shape )
    print( sample[ 1 ].shape )

The output is,

(180, 1)
(1,)

Finding the correct shape for tf.data.Dataset feeding into LSTM with Masking

Answers (1)

Related Questions