Reputation: 91
I am using huggingface TFBertModel
to do a classification task (from here: ), I am using the bare TFBertModel
with an added head dense layer and not TFBertForSequenceClassification
since I didn't see how I could use the latter using pretrained weights to only fine-tune the model.
As far as I know, fine tuning should give me about 80% or more accuracy in both BERT and ALBERT, but I am not coming even near that number:
Train on 3600 samples, validate on 400 samples
Epoch 1/2
3600/3600 [==============================] - 177s 49ms/sample - loss: 0.6531 - accuracy: 0.5792 - val_loss: 0.5296 - val_accuracy: 0.7675
Epoch 2/2
3600/3600 [==============================] - 172s 48ms/sample - loss: 0.6288 - accuracy: 0.6119 - val_loss: 0.5020 - val_accuracy: 0.7850
More epochs don't make much difference.
I am using CoLA public data set to fine-tune , this is how the data looks like:
gj04 1 Our friends won't buy this analysis, let alone the next one we propose.
gj04 1 One more pseudo generalization and I'm giving up.
gj04 1 One more pseudo generalization or I'm giving up.
gj04 1 The more we study verbs, the crazier they get.
...
And this is the code that loads the data into python:
import csv
def get_cola_data(max_items=None):
csv_file = open('cola_public/raw/in_domain_train.tsv')
reader = csv.reader(csv_file, delimiter='\t')
x = []
y = []
for row in reader:
x.append(row[3])
y.append(float(row[1]))
if max_items is not None:
x = x[:max_items]
y = y[:max_items]
return x, y
I verified that the data is in the format that I want it to be in the lists, and this is the code of the model itself:
#!/usr/bin/env python
import tensorflow as tf
from tensorflow import keras
from transformers import BertTokenizer, TFBertModel
import numpy as np
from cola_public import get_cola_data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased')
bert_model.trainable = False
x_input = keras.Input(shape=(512,), dtype=tf.int64)
x_mask = keras.Input(shape=(512,), dtype=tf.int64)
_, output = bert_model([x_input, x_mask])
output = keras.layers.Dense(1)(output)
model = keras.Model(
inputs=[x_input, x_mask],
outputs=output,
name='bert_classifier',
)
model.compile(
loss=keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'],
)
train_data_x, train_data_y = get_cola_data(max_items=4000)
encoded_data = [tokenizer.encode_plus(data, add_special_tokens=True, pad_to_max_length=True) for data in train_data_x]
train_data_x = np.array([data['input_ids'] for data in encoded_data])
mask_data_x = np.array([data['attention_mask'] for data in encoded_data])
train_data_y = np.array(train_data_y)
model.fit(
[train_data_x, mask_data_x],
train_data_y,
epochs=2,
validation_split=0.1,
)
cmd_input = ''
while True:
print("Type an opinion: ")
cmd_input = input()
# print('Your opinion is: %s' % cmd_input)
if cmd_input == 'exit':
break
cmd_input_tokens = tokenizer.encode_plus(cmd_input, add_special_tokens=True, pad_to_max_length=True)
cmd_input_ids = np.array([cmd_input_tokens['input_ids']])
cmd_mask = np.array([cmd_input_tokens['attention_mask']])
model.reset_states()
result = model.predict([cmd_input_ids, cmd_mask])
print(result)
Now, no matter if I use other dataset, other number of items from the datasets, if I use a dropout layer before the last dense layer, if I give another dense layer before the last one with higher number of units or if I use Albert instead of BERT, I always have low accuracy and high loss, and often, the validation accuracy is higher than training accuracy.
I have the same results if I try to use BERT/ALBERT for NER task, always the same result, which makes me believe I systematically make some fundamental mistake in fine tuning.
I know that I have bert_model.trainable = False
and it is what I want, since I want to train only the last head and not the pretrained weights and I know that people train that way successfully. Even if I train with the pretrained weights, the results are much worse.
I see I have a very high underfit, but I just can't put my finger where I could improve here, especially seeing that people tend tohave good results with just a single dense layer on top of the model.
Upvotes: 4
Views: 4720
Reputation: 91
The default learning rate is too high for BERT. Try setting it to one of the recommended learning rates from the original paper Appendix A.3 of 5e-5, 3e-5 or 2e-5.
Upvotes: 8