Using trained model to predict classes with data of different input_shape

Question

I have a saved model I trained on a small text (messaging) data corpus, and I'm trying to use that same model to predict either positive or negative sentiment (i.e. binary classification) on another corpus. I based the NLP model on a GOOGLE dev ML guide, which you can review here (if you think it useful - I used option A for all).

I keep getting an input shape error, I know that the error means I have to reshape the input to fit the expected shape. However, the data I want to predict on is not of this size. The error statement is:

ValueError: Error when checking input: expected dropout_8_input to have shape (519,) but got array with shape (184,)

The reason why the model expects the shape (519,) is because during training the corpus fed into the first dropout layer (in TfidfVectorized form) is print(x_train.shape) #(454, 519).

I'm new to ML, but it doesn't make sens to me that all the data I try to predict on after optimizing a model should be the same shape as the data that was used to train the model. Has anyone experienced an issue similar to this? Is there something that I'm missing, in how to train the model so that a different sized input can be predicted on? Or, am I misunderstanding on how models are to be used for class prediction?

I am basing myself on the following functions for model training:

from tensorflow.python.keras import models
from tensorflow.python.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.python.keras.layers import Convolution2D, MaxPooling2D

def mlp_model(layers, units, dropout_rate, input_shape, num_classes):
    """Creates an instance of a multi-layer perceptron model.

    # Arguments
        layers: int, number of `Dense` layers in the model.
        units: int, output dimension of the layers.
        dropout_rate: float, percentage of input to drop at Dropout layers.
        input_shape: tuple, shape of input to the model.
        num_classes: int, number of output classes.

    # Returns
        An MLP model instance.
    """
    op_units, op_activation = _get_last_layer_units_and_activation(num_classes)
    model = models.Sequential()
    model.add(Dropout(rate=dropout_rate, input_shape=input_shape))

#     print(input_shape)

    for _ in range(layers-1):
        model.add(Dense(units=units, activation='relu'))
        model.add(Dropout(rate=dropout_rate))

    model.add(Dense(units=op_units, activation=op_activation))
    return mode





def train_ngram_model(data,
                      learning_rate=1e-3,
                      epochs=1000,
                      batch_size=128,
                      layers=2,
                      units=64,
                      dropout_rate=0.2):
    """Trains n-gram model on the given dataset.

    # Arguments
        data: tuples of training and test texts and labels.
        learning_rate: float, learning rate for training model.
        epochs: int, number of epochs.
        batch_size: int, number of samples per batch.
        layers: int, number of `Dense` layers in the model.
        units: int, output dimension of Dense layers in the model.
        dropout_rate: float: percentage of input to drop at Dropout layers.

    # Raises
        ValueError: If validation data has label values which were not seen
            in the training data.

    # Reference
        For tuning hyperparameters, please visit the following page for
        further explanation of each argument:
        https://developers.google.com/machine-learning/guides/text-classification/step-5
    """
    # Get the data.
    (train_texts, train_labels), (val_texts, val_labels) = data

    # Verify that validation labels are in the same range as training labels.
    num_classes = get_num_classes(train_labels)
    unexpected_labels = [v for v in val_labels if v not in range(num_classes)]
    if len(unexpected_labels):
        raise ValueError('Unexpected label values found in the validation set:'
                         ' {unexpected_labels}. Please make sure that the '
                         'labels in the validation set are in the same range '
                         'as training labels.'.format(
                             unexpected_labels=unexpected_labels))

    # Vectorize texts.
    x_train, x_val = ngram_vectorize(
        train_texts, train_labels, val_texts)

    # Create model instance.
    model = mlp_model(layers=layers,
                                  units=units,
                                  dropout_rate=dropout_rate,
                                  input_shape=x_train.shape[1:],
                                  num_classes=num_classes) 
                                # num_classes determine which activation fn to use

    # Compile model with learning parameters.
    if num_classes == 2:
        loss = 'binary_crossentropy'
    else:
        loss = 'sparse_categorical_crossentropy'
    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
    model.compile(optimizer=optimizer, loss=loss, metrics=['acc'])

    # Create callback for early stopping on validation loss. If the loss does
    # not decrease in two consecutive tries, stop training.
    callbacks = [tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=2)]

    # Train and validate model.
    history = model.fit(
            x_train,
            train_labels,
            epochs=epochs,
            callbacks=callbacks,
            validation_data=(x_val, val_labels),
            verbose=2,  # Logs once per epoch.
            batch_size=batch_size)

    # Print results.
    history = history.history
    print('Validation accuracy: {acc}, loss: {loss}'.format(
            acc=history['val_acc'][-1], loss=history['val_loss'][-1]))

    # Save model.
    model.save('MCTR2.h5')
    return history['val_acc'][-1], history['val_loss'][-1]

From this I get the architecture of the model to be:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dropout (Dropout)            (None, 519)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                33280     
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 33,345
Trainable params: 33,345
Non-trainable params: 0
_________________________________________________________________

Using trained model to predict classes with data of different input_shape

Answers (1)

Related Questions