Tensorflow returns ValueError with tf.data.Dataset object, but works fine with np.array

Question

I'm working on a digit classifier model using this Kaggle dataset: https://www.kaggle.com/c/digit-recognizer/data?select=test.csv

When fitting the model with np.array objects, it works fine, but I can't pass tensorflow ds objects. Here's my code using ds objects for train/validation data:

import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functools import partial


train_df = pd.read_csv('train.csv')

def prepare_data(features_df, labels_df, test_ratio=0.1, val_ratio=0.1):
    features = features_df.to_numpy().reshape(features_df.shape[0], 28, 28)
    features = features[..., np.newaxis]

    labels = labels_df.to_numpy()

    X_train, X_test, y_train, y_test = ms.train_test_split(
        features,
        labels,
        test_size=test_ratio
    )

    X_train, X_valid, y_train, y_valid = ms.train_test_split(
        X_train,
        y_train,
        test_size=val_ratio
    )

    train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
    train_ds = train_ds.shuffle(2048).repeat()

    valid_ds = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
    valid_ds = valid_ds.shuffle(512).repeat()

    test_ds = tf.data.Dataset.from_tensor_slices((
        X_test,
        y_test
    ))

    return train_ds, valid_ds, test_ds


DefaultConv2D = partial(keras.layers.Conv2D,
                        kernel_size=4, activation='relu', padding="SAME")

model = keras.models.Sequential([
    DefaultConv2D(filters=128, kernel_size=7, input_shape=[28, 28, 1]),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=256),
    keras.layers.MaxPooling2D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=10, activation='softmax'),
])

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy',
    verbose=1,
    patience=20,
    mode='max',
    restore_best_weights=True
)

model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(
    train_ds,
    epochs=100,
    validation_data=valid_ds,
    callbacks=[early_stopping,],
    steps_per_epoch=64
)

I get this error message:

    ValueError: Input 0 of layer sequential_2 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [28, 28, 1]

But if I change the code to use np.array objects instead, it works just fine:

test_ratio=0.1
val_ratio=0.1

features = features_df.to_numpy().reshape(features_df.shape[0], 28, 28)
features = features[..., np.newaxis]

labels = labels_df.to_numpy()

X_train, X_test, y_train, y_test = ms.train_test_split(
    features,
    labels,
    test_size=test_ratio
)

X_train, X_valid, y_train, y_valid = ms.train_test_split(
    X_train,
    y_train,
    test_size=val_ratio
)


history = model.fit(
    X_train,
    y_train,
    epochs=100,
    validation_data=(X_valid, y_valid),
    callbacks=[early_stopping,],
    steps_per_epoch=64
)

I checked several similar questions, nothing worked so far.

Richard X · Accepted Answer

It seems that you forgot to add the .batch() method at the end of your tf.data.Dataset objects, since your error refers to the batch dimension. From what I understand, creating a tf.data.Dataset stores the data set as something similar to a python generator rather than storing the whole data set in memory. This means that the cardinality (number of data points) of the data set is unknown. When you pass in a number to steps_per_epoch when using a tf.data.Dataset, your model uses that number to take that many batch sized samples from your data set. It is unable to calculate ahead of time the size of batches since the cardinality is unknown. Since you haven't batched your data, it will take individual samples. When creating data as numpy arrays, you have a defined number of data points, so your model will be able to calculate the size of your batches and use that.

Tensorflow returns ValueError with tf.data.Dataset object, but works fine with np.array

Answers (1)

Related Questions