Reputation: 13
I'm doing a time series classification task and I'm having trouble finding the correct way to use tf.data.Dataset
. I can create a working model using numpy arrays as in the code below (the data has already been pre-padded with zeros to a max length of 180):
tf_data.shape, labels.shape
> ((225970, 180, 1), (225970,))
So I have 225970 instances of 180 time steps with 1 feature. And I can fit the model as follows. This works fine and creates the appropriate output/predictions:
model = keras.Sequential(
[
layers.Masking(mask_value=0, input_shape=(180,1)),
layers.LSTM(16),
layers.Dense(1, activation='sigmoid')
]
)
model.compile(
optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy']
)
model.fit(tf_data, labels, epochs=5)
However, when I try the following using tf.data.Dataset.from_tensors
:
tf_dataset = tf.data.Dataset.from_tensors((tf_data, labels.astype(int)))
tf_dataset = tf_dataset.shuffle(buffer_size=1024).batch(64)
model_v1 = keras.Sequential(
[
layers.Input(shape=(180,1), batch_size=64),
layers.Masking(mask_value=0),
layers.LSTM(16),
layers.Dense(1, activation='sigmoid')
]
)
model_v2 = keras.Sequential(
[
layers.Masking(mask_value=0, shape=(180,1), batch_size=64),
layers.LSTM(16),
layers.Dense(1, activation='sigmoid')
]
)
<model_v1 or model_v2>.compile(
optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy']
)
<model_v1 or model_v2>.fit(tf_dataset, epochs=10, steps_per_epoch=30)
I'm met with the errors
model_v1 - ValueError: Error when checking input: expected input_48 to have 3 dimensions, but got array with shape (None, 225970, 180, 1)
model_v2 - ValueError: Error when checking input: expected masking_70_input to have 3 dimensions, but got array with shape (None, 225970, 180, 1)
Can anyone explain what I need to do to either my tensorflow Dataset or to my model to ensure it will work with the tf.data.Dataset
and not just the numpy array?
Upvotes: 1
Views: 557
Reputation: 4299
You should use tf.data.Dataset.from_tensor_slices
instead,
tf_dataset = tf.data.Dataset.from_tensor_slices((tf_data, labels.astype(int)))
As per the official documentation of the method tf.data.Dataset.from_tensors
,
from_tensors
produces a dataset containing only a single element. To slice the input tensor into multiple elements, usefrom_tensor_slices
instead.
While the official docs for tf.data.Dataset.from_tensor_slices
says,
The given tensors are sliced along their first dimension. This operation preserves the structure of the input tensors, removing the first dimension of each tensor and using it as the dataset dimension. All input tensors must have the same size in their first dimensions.
The *.from_tensor_slices
method slices the input tensor and considers the given array ( the tensor in our case ) as 225970 data instances. This allows to call methods like *.shuffle
and *.batch()
which shuffle and batch data instances respectively.
We can understand this with an example. First let's generate some dummy data instances.
x = np.random.rand( 2500 , 180 , 1 )
y = np.random.rand( 2500 , 1 )
With *.from_tensors
,
tensor_ds = tf.data.Dataset.from_tensors( ( x , y ) )
for sample in tensor_ds.take( 1 ):
print( sample[ 0 ].shape )
print( sample[ 1 ].shape )
The output is,
(2500, 180, 1)
(2500, 1)
And with *.from_tensor_slices
,
tensor_ds = tf.data.Dataset.from_tensor_slices( ( x , y ) )
for sample in tensor_ds.take( 1 ):
print( sample[ 0 ].shape )
print( sample[ 1 ].shape )
The output is,
(180, 1)
(1,)
Upvotes: 2