Huggingface TFBertForSequenceClassification always predicts the same label

Question

TL;DR: My model always predicts the same labels and I don't know why. Below is my entire code for fine-tuning in the hopes that someone can point out to me where I am going wrong.

I am using Huggingface's TFBertForSequenceClassification for sequence classification task to predict 4 labels of sentences in German text.

I use the bert-base-german-cased model since I don't use only lower case text (since German is more case sensitive than English).

I get my input from a csv file that I construct from an annotated corpus I received. Here's a sample of that:

0       Hier kommen wir ins Spiel Die App Cognitive At...
1       Doch wenn Athlet Lebron James jede einzelne Mu...
2       Wie kann ein Gehirn auf Hochleistung getrimmt ...
3       Wie schafft es Warren Buffett knapp 1000 Wörte...
4       Entfalte dein mentales Potenzial und werde ein...
Name: sentence_clean, Length: 3094, dtype: object

And those are my labels, from the same csv file:

0       e_1
1       e_4
2       e_4
3       e_4
4       e_4

The distinct labels are: e_1, e_2, e_3, and e_4

This is the code I am using to fine tune my model:

import pandas as pd
import numpy as np
import os
    
# read in data
# sentences_df = pd.read_csv('path/file.csv')


X = sentences_df.sentence_clean
Y = sentences_df.classId

# =============================================================================
# One hot encode labels
# =============================================================================

# integer encode labels
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()
Y_integer_encoded = label_encoder.fit_transform(list(Y))


# one hot encode labels
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse=False)
Y_integer_encoded_reshaped = Y_integer_encoded.reshape(len(Y_integer_encoded), 1)
Y_one_hot_encoded = onehot_encoder.fit_transform(Y_integer_encoded_reshaped)

# train test split
from sklearn.model_selection import train_test_split


X_train_raw, X_test_raw, y_train, y_test = train_test_split(X, Y_one_hot_encoded, test_size=0.20, random_state=42)


# =============================================================================
# Perpare datasets for finetuning
# =============================================================================
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

from transformers import BertTokenizer, TFBertForSequenceClassification


tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased') # initialize tokenizer


# tokenize trai and test sets
max_seq_length = 128

X_train_tokens = tokenizer(list(X_train_raw),
                            truncation=True,
                            padding=True)

X_test_tokens = tokenizer(list(X_test_raw),
                            truncation=True,
                            padding=True)


# create TF datasets as input for BERT model
bert_train_ds = tf.data.Dataset.from_tensor_slices((
    dict(X_train_tokens),
    y_train
))

bert_test_ds = tf.data.Dataset.from_tensor_slices((
    dict(X_test_tokens),
    y_test
))

# =============================================================================
# setup model and finetune
# =============================================================================

# define hyperparams
num_labels = 4
learninge_rate = 2e-5
epochs = 3
batch_size = 16

# create BERT model
bert_categorical_partial = TFBertForSequenceClassification.from_pretrained('bert-base-german-cased', num_labels=num_labels)

optimizer = tf.keras.optimizers.Adam(learning_rate=learninge_rate)
bert_categorical_partial.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

history = bert_categorical_partial.fit(bert_train_ds.shuffle(100).batch(batch_size),
          epochs=epochs,
          # batch_size=batch_size,
          validation_data=bert_test_ds.shuffle(100).batch(batch_size))

And here is the output from fine-tuning:

Epoch 1/3
155/155 [==============================] - 31s 198ms/step - loss: 8.3038 - accuracy: 0.2990 - val_loss: 8.7751 - val_accuracy: 0.2811
Epoch 2/3
155/155 [==============================] - 30s 196ms/step - loss: 8.2451 - accuracy: 0.2913 - val_loss: 8.9314 - val_accuracy: 0.2779
Epoch 3/3
155/155 [==============================] - 30s 196ms/step - loss: 8.3101 - accuracy: 0.2913 - val_loss: 9.0355 - val_accuracy: 0.2746

Lastly, I try to predict the labels of the test set and validate the results with a confusion matrix:

X_test_tokens_new = {'input_ids': np.asarray(X_test_tokens['input_ids']),
                     'token_type_ids': np.asarray(X_test_tokens['token_type_ids']),
                     'attention_mask': np.asarray(X_test_tokens['attention_mask']),
                     }

pred_raw = bert_categorical_partial.predict(X_test_tokens_new)
pred_proba = tf.nn.softmax(pred_raw).numpy()
pred = pred_proba[0].argmax(axis = 1)
y_true = y_test.argmax(axis = 1)

cm = confusion_matrix(y_true, pred)

Output of print(cm):

array([[  0,   0,   0,  41],
       [  2,   0,   0, 253],
       [  2,   0,   0, 219],
       [  6,   0,   0,  96]], dtype=int64)

As you can see, my accuracy is really bad, and when I look at the cm, I can see that my model pretty much just predicts one single label. I've tried everything and ran the model multiple times, but I always get the same results. I do know that the data I am working with isn't great and I am only training on abour 2k sentences with labels. But I have a feeling the accuracy should still be higher and, more importantly, the model shouldn't just predict one single label 98% of the time, right?

I posted everything I am using to run the model in the hopes someone can point me to where I am going wrong. Thank very much in advance for your help!

Andrey · Accepted Answer

You trained for a couple of minutes. It is not enough even for pretrained BERT.

Try to decrease learning rate to get your accuracy increasing after every epoch (for the first 10 epochs). And train for more epochs (until you see the validation accuracy decreasing for 10 epochs).

Huggingface TFBertForSequenceClassification always predicts the same label

Answers (1)

Related Questions