Reputation: 89
I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.
First I define a BERT-tokenizer and then tokenize my text:
from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
max_length=128, pad_to_max_length=True)
def tokenize(sentences, tokenizer):
input_ids, input_masks, input_segments = [],[],[]
for sentence in tqdm(sentences):
inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=True)
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')
train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)
Next, I load my integer labels which are: 0, 1, 2, 3.
d_y = train['label']
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 3
9 1
Name: label, dtype: int64
Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel
distil_bert = 'distilbert-base-uncased'
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)
config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)
input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32')
embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(4, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)
for layer in model.layers[:3]:
layer.trainable = False
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:
model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)
Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>
EDIT: The model computes losses on the first epoch but it starts returning NaNs at the second epoch. What could be causing that problem???
Does anyone has any ideas about what I am doing wrong? All suggestions are welcomed!
Upvotes: 3
Views: 6522
Reputation: 660
I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.
Upvotes: 1
Reputation: 75
The problem would have occurred because of not specifying the num_labels
At the final output layer, by default K = 1 (number of labels), and as mentioned
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
so while fine tuning we need to provide num_labels when going for multi class classification.
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)
Upvotes: 4
Reputation: 278
I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.
train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()
Upvotes: 1
Reputation: 11240
The problem is here:
X = tf.keras.layers.Dense(1, activation='softmax')(X)
At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.
Upvotes: 2