PradhanKamal
PradhanKamal

Reputation: 548

How to properly extract entities like facilities and establishment from text using NLP and Entity recognition?

I need to identify all the establishments and facilities from a given text using natural language processing and NER.

Example text:

The government panned to build new parks, swimming pool and commercial complex for out town and improve existing housing complex, schools and townhouse.

Expected entities to be identified:

parks, swimming pool, commercial complex, housing complex, school and townhouse

I did explore some python libraries like Spacy and NLTK but results were not great only 2 entities were identified. I reckon the data needs to be pre-processed properly.

What should I do to improve the results ? Is there any other libraries/framework that is better for this use case ? Is there any way to train our model using the existing db ?

Upvotes: 4

Views: 2591

Answers (3)

Julien Salinas
Julien Salinas

Reputation: 1139

In 2022 you don't necessarily need to train a new model for that.

Instead you can use a large language model like GPT-3, GPT-J, or GPT-NeoX, and perform entity extraction on any sort of complex entity, without even training a new model for it!

See how to install GPT-J and use it in Python here: https://github.com/kingoflolz/mesh-transformer-jax . If this model is too big for your machine, you can also use a smaller one, like this small version of OPT (by Facebook): https://huggingface.co/facebook/opt-125m

In order to understand how to use these models for NER, see this article about few-shot learning: https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html

And also see this TDS article about few-shot learning and NER: https://towardsdatascience.com/advanced-ner-with-gpt-3-and-gpt-j-ce43dc6cdb9c

Last of all, you might be interested in this this video about NER with GPT-NeoX vs spaCy: https://www.youtube.com/watch?v=E-qZDwXpeY0

Upvotes: 1

eldams
eldams

Reputation: 750

Very good answer from @anat-kumar, indeed you have to train a model, unless you find a tool able to focus on organizations with fine-grained entities, by the way some typologies include facilities as a subtype of organization / location, it may be the same for establishment.

From a technological perspective, sure you can use Spacy and train it, or rasa https://rasa.com is also a nice solution, in that case you have to configure it only for entities (don't bother about intention / stories).

My contribution would be to include pretrained embeddings to facilitate training. Depending on chosen tool and language, it may be already included, or you may have to specify / implement it. SOTA works are BERT which may be integrated with HuggingFace https://huggingface.co where you'll find many pretrained models. This will allow the system to recognize entities even if they have not been annotated in your dedicated training corpus. It is now quite standard and should not be a problem from an implementation perspective.

And, most importantly, don't forget to evaluate on a separate (splitted) dataset!

Upvotes: 0

Anant Kumar
Anant Kumar

Reputation: 641

As @Sergey mentioned, you'd need a custom NER Model. And Spacy really comes handy for custom NER, given you have the training data. Here's a straightforward way to do it and considering your example -

import spacy
from tqdm import tqdm
import random
train_data = [
    ('Government built new parks', {
        'entities': [(0, 10, 'ORG'),(21, 26, 'FAC')]
    }),
]

Create a Blank Model & Add 'NER' pipe

nlp=spacy.blank('en')
ner=nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)

Training Step

n_iter=100 
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(train_data)
            losses = {}
            for text, annotations in tqdm(train_data):
                nlp.update(
                    [text],  
                    [annotations],  
                    drop=0.25,  
                    sgd=optimizer,
                    losses=losses)
            print(losses)
 
#Test           
for text, _ in train_data:
    doc = nlp(text)
    print('Entities', [(ent.text, ent.label_) for ent in doc.ents])

Tune the Hyper-Parameters and Check which works best for you.

Other Ways to explore -

  1. Train a seq2seq model for Custom NER. (huggingface transformers library might come in handy)
  2. Use Unsupervised NER with BERT or other transformer models.
  3. Recently, Language Models have provided 'state-of-the-art' results for NER.

Cheers !

Upvotes: 4

Related Questions