how to merge entities of same type if spaCy shows multiple entities due to ',' '\n' or 'any other reason'

I have to extract organization name from company letters. When extracting entities, due to ',' or '\n' or 'sometimes for other reason' it splits the organization name.

spacy_data = nlp(text)
spacy_data.ents if ent.label_ in =='ORG' 

expected output: capital international partners vi
actual output:   capital 
                   international partners vi 

It showing as two different organizations. I want my final output to be capital_international_partners_vi so that I can use it further for creating one-word vector

Upvotes: 0

Views: 901

Answers (1)

Yooper
Yooper

Reputation: 11

I use textacy to normalize the data after spacy has extracted the named entities and prior to inserting into my database.

from textacy.preprocess import normalize_whitespace, preprocess_text

def text_cleaner(text) :

    cleaned_text = preprocess_text(my_text, no_currency_symbols = True, no_numbers = True,
                    lowercase=True, no_accents=True, no_contractions=True, no_punct = True).replace('\n','')

   return normalize_whitespace(cleaned_text)

Upvotes: 1

Related Questions