Tensorflow BERT for token-classification - exclude pad-tokens from accuracy while training and testing

Question

I'm doing token-based classification using the pre-trained BERT-model for tensorflow to automatically label cause and effects in sentences.

To access BERT, I'm using the TFBertForTokenClassification-Interface from huggingface: https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification

The sentences I use to train are all converted to tokens (basically a mapping of words to numbers) according to the BERT-tokenizer and then padded to a certain length before training, so when one sentence has only 50 tokens and another one has only 30 the first one is filled up with 50 pad-tokens and the second one with 70 of them to get a universal input sentence-length of 100.

I then train my model to predict on every token which label this token belongs to; whether it is part of the cause, the effect or none of them.

However, during training and evaluation, my model does predictions on the PAD-tokens as well and they are also included in the accuracy of the model. As PAD-tokens are very easy to predict for the model (they always have the same token and they all have the "none" label which means they neither belong to the cause nor the effect of the sentence), they really distort my model's accuracy.

For example, if you have a sentence which has 30 words -> 30 tokens and you pad all sentences to a length of 100, then this sentence would get a score of 70% even if the model predicted none of the "real" tokens correctly. This way i'm getting training and validation accuracy of 90+% really quick although the model performs poorly on the real pad-tokens.

I thought that attention-mask is there to solve this problem but this doesn't seem to be the case.

The input-datasets are created as follows:

def example_to_features(input_ids,attention_masks,token_type_ids,label_ids):
  return {"input_ids": input_ids,
          "attention_mask": attention_masks},label_ids

train_ds = tf.data.Dataset.from_tensor_slices((input_ids_train,attention_masks_train,token_ids_train,label_ids_train)).map(example_to_features).shuffle(buffer_size=1000).batch(32)

Model creation:

from transformers import TFBertForTokenClassification

num_epochs = 30

model = TFBertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)

model.layers[-1].activation = tf.keras.activations.softmax

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

model.summary()

And then I train it like this:

history = model.fit(train_ds, epochs=num_epochs, validation_data=validate_ds)

Has anyone encountered this problem so far or does know how to exclude the predictions on pad-tokens from the model's accuracy during training and evaluation?

Tensorflow BERT for token-classification - exclude pad-tokens from accuracy while training and testing

Answers (1)

Related Questions