Reputation: 1
I have to extract organization name from company letters. When extracting entities, due to ',' or '\n' or 'sometimes for other reason' it splits the organization name.
spacy_data = nlp(text)
spacy_data.ents if ent.label_ in =='ORG'
expected output: capital international partners vi
actual output: capital
international partners vi
It showing as two different organizations. I want my final output to be capital_international_partners_vi
so that I can use it further for creating one-word vector
Upvotes: 0
Views: 901
Reputation: 11
I use textacy to normalize the data after spacy has extracted the named entities and prior to inserting into my database.
from textacy.preprocess import normalize_whitespace, preprocess_text
def text_cleaner(text) :
cleaned_text = preprocess_text(my_text, no_currency_symbols = True, no_numbers = True,
lowercase=True, no_accents=True, no_contractions=True, no_punct = True).replace('\n','')
return normalize_whitespace(cleaned_text)
Upvotes: 1