Reputation: 3414
I've finetuned a Huggingface BERT model for Named Entity Recognition. Everything is working as it should. Now I've setup a pipeline for token classification in order to predict entities out the text I provide. Even this is working fine.
I know that BERT models are supposed to be fed with sentences less than 512 tokens long. Since I have texts longer than that, I split the sentences in shorter chunks and I store the chunks in a list chunked_sentences
. To make it brief my tokenizer for training looks like this:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenized_inputs = tokenizer(chunked_sentences, is_split_into_words=True, padding='longest')
I pad everything to the longest sequence and avoid truncation so that if a sentence is tokenized and goes beyond 512 tokens I receive a warning that I won't be able to train. This way I know that I have to split the sentences in smaller chunks.
During inference I wanted to achieve the same thing, but I haven't found a way to pass arguments to the pipeline's tokenizer. The code looks like this:
from transformers import pipeline
ner_pipeline = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
I'm pretty sure that if a sentence is tokenized and surpasses the 512 tokens, the extra tokens will be truncated and I'll get no warning. I want to avoid this.
I tried passing arguments to the tokenizer like this:
tokenizer_kwargs = {'padding': 'longest'}
out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)
I got that idea from this answer, but it seems not to be working, since I get the following error:
Traceback (most recent call last):
File "...\inference.py", line 42, in <module>
out = ner_pipeline(text, aggregation_strategy='simple', **tokenizer_kwargs)
File "...\venv\lib\site-packages\transformers\pipelines\token_classification.py", line 191, in __call__
return super().__call__(inputs, **kwargs)
File "...\venv\lib\site-packages\transformers\pipelines\base.py", line 1027, in __call__
preprocess_params, forward_params, postprocess_params = self._sanitize_parameters(**kwargs)
TypeError: TokenClassificationPipeline._sanitize_parameters() got an unexpected keyword argument 'padding'
Process finished with exit code 1
Any ideas? Thanks.
Upvotes: 5
Views: 3693
Reputation: 96
I took a closer look at https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/pipelines/token_classification.py#L86. It seems you can override preprocess()
to disable truncation and add padding to longest.
from transformers import TokenClassificationPipeline
class MyTokenClassificationPipeline(TokenClassificationPipeline):
def preprocess(self, sentence, offset_mapping=None):
truncation = False
padding = 'longest'
model_inputs = self.tokenizer(
sentence,
return_tensors=self.framework,
truncation=truncation,
padding=padding,
return_special_tokens_mask=True,
return_offsets_mapping=self.tokenizer.is_fast,
)
if offset_mapping:
model_inputs["offset_mapping"] = offset_mapping
model_inputs["sentence"] = sentence
return model_inputs
ner_pipeline = MyTokenClassificationPipeline(model=model_folder, tokenizer=model_folder)
out = ner_pipeline(text, aggregation_strategy='simple')
Upvotes: 2