Reputation: 548
I need to identify all the establishments
and facilities
from a given text using natural language processing and NER.
Example text:
The government panned to build new parks, swimming pool and commercial complex for out town and improve existing housing complex, schools and townhouse.
Expected entities to be identified:
parks, swimming pool, commercial complex, housing complex, school and townhouse
I did explore some python libraries like Spacy and NLTK but results were not great only 2 entities were identified. I reckon the data needs to be pre-processed properly.
What should I do to improve the results ? Is there any other libraries/framework that is better for this use case ? Is there any way to train our model using the existing db ?
Upvotes: 4
Views: 2591
Reputation: 1139
In 2022 you don't necessarily need to train a new model for that.
Instead you can use a large language model like GPT-3, GPT-J, or GPT-NeoX, and perform entity extraction on any sort of complex entity, without even training a new model for it!
See how to install GPT-J and use it in Python here: https://github.com/kingoflolz/mesh-transformer-jax . If this model is too big for your machine, you can also use a smaller one, like this small version of OPT (by Facebook): https://huggingface.co/facebook/opt-125m
In order to understand how to use these models for NER, see this article about few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html
And also see this TDS article about few-shot learning and NER: https://towardsdatascience.com/advanced-ner-with-gpt-3-and-gpt-j-ce43dc6cdb9c
Last of all, you might be interested in this this video about NER with GPT-NeoX vs spaCy: https://www.youtube.com/watch?v=E-qZDwXpeY0
Upvotes: 1
Reputation: 750
Very good answer from @anat-kumar, indeed you have to train a model, unless you find a tool able to focus on organizations with fine-grained entities, by the way some typologies include facilities as a subtype of organization / location, it may be the same for establishment.
From a technological perspective, sure you can use Spacy and train it, or rasa https://rasa.com is also a nice solution, in that case you have to configure it only for entities (don't bother about intention / stories).
My contribution would be to include pretrained embeddings to facilitate training. Depending on chosen tool and language, it may be already included, or you may have to specify / implement it. SOTA works are BERT which may be integrated with HuggingFace https://huggingface.co where you'll find many pretrained models. This will allow the system to recognize entities even if they have not been annotated in your dedicated training corpus. It is now quite standard and should not be a problem from an implementation perspective.
And, most importantly, don't forget to evaluate on a separate (splitted) dataset!
Upvotes: 0
Reputation: 641
As @Sergey mentioned, you'd need a custom NER Model. And Spacy really comes handy for custom NER, given you have the training data. Here's a straightforward way to do it and considering your example -
import spacy
from tqdm import tqdm
import random
train_data = [
('Government built new parks', {
'entities': [(0, 10, 'ORG'),(21, 26, 'FAC')]
}),
]
Create a Blank Model & Add 'NER' pipe
nlp=spacy.blank('en')
ner=nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
Training Step
n_iter=100
for _, annotations in train_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_data)
losses = {}
for text, annotations in tqdm(train_data):
nlp.update(
[text],
[annotations],
drop=0.25,
sgd=optimizer,
losses=losses)
print(losses)
#Test
for text, _ in train_data:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
Tune the Hyper-Parameters and Check which works best for you.
Other Ways to explore -
Cheers !
Upvotes: 4