Reputation: 3414
I've build an NER (named entity recognition) model, based on a HuggingFace existing model and that I fine-tuned to recognize my custom entities. The text I want to run my model on is in a txt
file.
The code of how I use the model:
from transformers import pipeline
# loading the fine-tuned model
ner_pipeline = pipeline('token-classification', model="./my-model.model/", tokenizer="./my-model.model/", ignore_labels=[])
with open(my_file, 'r', encoding="utf8") as f:
lines = f.readlines()
joined_lines = ' '.join(lines)
result = ner_pipeline(joined_lines, aggregation_strategy='first')
text = ""
for group in result:
if group["entity_group"] != 'O':
# substitute the entity with its tag
text += group["entity_group"]+ " "
else:
text += group["word"] + " "
Basically what I do is substituting the entities recognized with the entity tag, and leave the rest of the text as is.
With my code, the final text
is filled with the content exactly as I want it, but the structure is lost. While doing ' '.join(lines)
I'm basically throwing away the \n
s inside the text, that however I would like to keep in my reconstructed text.
I've tried feeding the pipeline with single sentences (each of the f.readlines()
) end not the full joined text, but the results are far worse. The model works a lot better predicting on the whole text.
Does anyone knows a way how I could keep or retrieve the structure of the original text? Thanks.
Upvotes: 1
Views: 624
Reputation: 5802
The group
s have a start
and end
index that tell you which part of the input string each label corresponds to. I.e., you can pass the text as a whole, with the newlines intact (ner_pipeline(f.read(), ...)
) and subsequently replace substrings.
Here's a working, minimal reproducible example. The only thing to note here is that we replace from right to left (result[::-1]
) so we don't mess up the indices of subsequent labels by changing the length of the string when replacing.
from nltk.corpus import brown # for example data
from transformers import pipeline
ner_pipeline = pipeline('token-classification')
# equivalent to f.read()
text = '\n'.join(' '.join(sent) for sent in brown.sents()[:100])
result = ner_pipeline(lines_joined, aggregation_strategy='first')
def replace_at(label, start, end, txt):
"""Replace substring of txt from start to end with label"""
return ''.join((txt[:start], label, txt[end:]))
# Substitution
for group in result[::-1]:
ent = group["entity_group"]
if ent != 'ORG': # for testing since there's no 'O' in the default model
text = replace_at(ent, group['start'], group['end'], text)
sentences = text.split('\n')
Example input/output (first line):
"The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place ."
After processing:
"The Fulton County Grand Jury said Friday an investigation of LOC's recent primary election produced `` no evidence '' that any irregularities took place ."
^^^
Upvotes: 3