Injarapu Sri Sharanya
Injarapu Sri Sharanya

Reputation: 25

ValueError: Shapes (426530, 2) and (1930, 2) are incompatible for y_pred and y_test

I am working on the DistillBert project for binary classification. I am trying to run the following code using the Spam SMS data set (You can also use the IMDB dataset, it is also giving the same issue), I am trying to find out the recall, precision, and AUC score. However, I am getting a Value error.

Here I am using the BinaryCrossentropy loss function and Adam optimizer.

Dataset: The dataset used here is the spam SMS dataset which has binary labels 0 for standard SMS and 1 for spam SMS. The same error can be reproduced using the IMDB data set for this code.

Error:

ValueError: Shapes (426530, 2) and (1930, 2) are incompatible

I am getting this error right on running the tab containing this code:

m = tf.keras.metrics.Recall()
m.update_state(y_test_encoded, y_pred)
m.result().numpy()

Here y_pred is the predicted labels and y_test_encoded is the one-hot encoded ground truth labels. the test_dataset used here for prediction is tokenized test data converted to the TensorFlow dataset using from_tensor_slices. I am assuming that the problem is because of the different shapes of the predicted and the ground truth labels.

Code:

import pandas as pd
import tensorflow as tf
import transformers
from transformers import DistilBertTokenizer
from transformers import TFAutoModelForSequenceClassification
pd.set_option('display.max_colwidth', None)
MODEL_NAME = 'distilbert-base-uncased'
BATCH_SIZE = 8
N_EPOCHS = 3

train = pd.read_csv("train_set.csv", error_bad_lines=False)
test = pd.read_csv("test_set.csv", error_bad_lines=False)

X_train = train.text
X_test = test.text
y_train = train.label
y_test = test.label

#One-hot encoding of labels
y_train_encoded = tf.one_hot(y_train.values, 2)
y_test_encoded = tf.one_hot(y_test.values, 2)

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

train_encodings = tokenizer(list(X_train.values),
                        truncation=True, 
                        padding=True)
test_encodings = tokenizer(list(X_test.values),
                       truncation=True, 
                       padding=True)

train_dataset = 
tf.data.Dataset.from_tensor_slices((dict(train_encodings),list(y_train_encoded)))

test_dataset = 
tf.data.Dataset.from_tensor_slices((dict(test_encodings),list(y_test_encoded)))
test_dataset2 = test_dataset.shuffle(buffer_size=1024).take(1000).batch(16)

model = TFAutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)

losss = tf.keras.losses.BinaryCrossentropy((from_logits=True)

model.compile(optimizer=optimizerr,
          loss=losss,
          metrics=['accuracy'])

print("Evaluate Base model on test data")
results = model.evaluate(test_dataset2)
print("test loss, test acc:", results)

model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE),
      epochs=N_EPOCHS,
      batch_size=BATCH_SIZE)

predictions = model.predict(test_dataset)


y_pred = tf.round(tf.nn.sigmoid(predictions.logits))

m = tf.keras.metrics.Recall()
m.update_state(y_test_encoded, rounded_predictions)
m.result().numpy()
# This is where I getting the above-mentioned error.

How can I fix the error and get the recall, precision and AUC scores?

Upvotes: 0

Views: 88

Answers (1)

Djinn
Djinn

Reputation: 856

OP is predicting on the test set, but comparing the predictions with the original, larger dataset.

predictions = model.predict(test_dataset)  # this data needs to be used for comparison below

Change:

m.update_state(y_test_encoded, rounded_predictions)

To include labels from the dataset on .predict():

true_labels = # labels from test_dataset
m.update_state(true_labels, rounded_predictions)

Upvotes: 1

Related Questions