Big loss and low accuracy on training data in both BERT and ALBERT

Question

I am using huggingface TFBertModel to do a classification task (from here: ), I am using the bare TFBertModel with an added head dense layer and not TFBertForSequenceClassification since I didn't see how I could use the latter using pretrained weights to only fine-tune the model.

As far as I know, fine tuning should give me about 80% or more accuracy in both BERT and ALBERT, but I am not coming even near that number:

Train on 3600 samples, validate on 400 samples
Epoch 1/2
3600/3600 [==============================] - 177s 49ms/sample - loss: 0.6531 - accuracy: 0.5792 - val_loss: 0.5296 - val_accuracy: 0.7675
Epoch 2/2
3600/3600 [==============================] - 172s 48ms/sample - loss: 0.6288 - accuracy: 0.6119 - val_loss: 0.5020 - val_accuracy: 0.7850

More epochs don't make much difference.

I am using CoLA public data set to fine-tune , this is how the data looks like:

gj04    1       Our friends won't buy this analysis, let alone the next one we propose.
gj04    1       One more pseudo generalization and I'm giving up.
gj04    1       One more pseudo generalization or I'm giving up.
gj04    1       The more we study verbs, the crazier they get.
...

And this is the code that loads the data into python:

import csv


def get_cola_data(max_items=None):
    csv_file = open('cola_public/raw/in_domain_train.tsv')

    reader = csv.reader(csv_file, delimiter='	')
    x = []
    y = []

    for row in reader:
        x.append(row[3])
        y.append(float(row[1]))

    if max_items is not None:
        x = x[:max_items]
        y = y[:max_items]

    return x, y

I verified that the data is in the format that I want it to be in the lists, and this is the code of the model itself:

#!/usr/bin/env python

import tensorflow as tf
from tensorflow import keras
from transformers import BertTokenizer, TFBertModel
import numpy as np
from cola_public import get_cola_data


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

bert_model.trainable = False

x_input = keras.Input(shape=(512,), dtype=tf.int64)
x_mask = keras.Input(shape=(512,), dtype=tf.int64)

_, output = bert_model([x_input, x_mask])
output = keras.layers.Dense(1)(output)

model = keras.Model(
    inputs=[x_input, x_mask],
    outputs=output,
    name='bert_classifier',
)

model.compile(
    loss=keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(),
    metrics=['accuracy'],
)

train_data_x, train_data_y = get_cola_data(max_items=4000)

encoded_data = [tokenizer.encode_plus(data, add_special_tokens=True, pad_to_max_length=True) for data in train_data_x]

train_data_x = np.array([data['input_ids'] for data in encoded_data])
mask_data_x = np.array([data['attention_mask'] for data in encoded_data])

train_data_y = np.array(train_data_y)

model.fit(
    [train_data_x, mask_data_x],
    train_data_y,
    epochs=2,
    validation_split=0.1,
)

cmd_input = ''

while True:
    print("Type an opinion: ")
    cmd_input = input()
    # print('Your opinion is: %s' % cmd_input)

    if cmd_input == 'exit':
        break

    cmd_input_tokens = tokenizer.encode_plus(cmd_input, add_special_tokens=True, pad_to_max_length=True)
    cmd_input_ids = np.array([cmd_input_tokens['input_ids']])
    cmd_mask = np.array([cmd_input_tokens['attention_mask']])

    model.reset_states()
    result = model.predict([cmd_input_ids, cmd_mask])

    print(result)

Now, no matter if I use other dataset, other number of items from the datasets, if I use a dropout layer before the last dense layer, if I give another dense layer before the last one with higher number of units or if I use Albert instead of BERT, I always have low accuracy and high loss, and often, the validation accuracy is higher than training accuracy.

I have the same results if I try to use BERT/ALBERT for NER task, always the same result, which makes me believe I systematically make some fundamental mistake in fine tuning.

I know that I have bert_model.trainable = False and it is what I want, since I want to train only the last head and not the pretrained weights and I know that people train that way successfully. Even if I train with the pretrained weights, the results are much worse.

I see I have a very high underfit, but I just can't put my finger where I could improve here, especially seeing that people tend tohave good results with just a single dense layer on top of the model.

Big loss and low accuracy on training data in both BERT and ALBERT

Answers (1)

Related Questions