Reputation: 101
I have been using the PyTorch implementation of Google's BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this BERT model (632 > 512). Running this sequence through BERT will result in indexing errors
The full code is available in this colab notebook.
To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
The result shouldn't have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson --> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.
Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) people have done on their end recently?
Upvotes: 6
Views: 2842
Reputation: 728
I think you should use batch_encode_plus
and mask output as well as the encoding.
Please see batch_encode_plus in https://huggingface.co/transformers/main_classes/tokenizer.html
Upvotes: 1
Reputation: 101
I've found a fix to get around this. Running the same code with pytorch-pretrained-bert==0.4.0 solves the issue and the performance is restored to normal. There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :)
pytorch-pretrained-bert==0.4.0, Test F1-Score: 0.82
pytorch-pretrained-bert==0.6.1, Test F1-Score: 0.41
Thanks.
Upvotes: 4