Reputation: 51
I am attepmting to extract names with the nltk python module.
import nltk
#!pip install svgling
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import nltk
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
text = "Elon Musk 889-888-8888 [email protected] Jeff Bezos (345)123-1234 [email protected] Reshma Saujani [email protected] 888-888-8888 Barkevious Mingo"
nltk_results = ne_chunk(pos_tag(word_tokenize(text)))
for nltk_result in nltk_results:
if type(nltk_result) == Tree:
name = ''
for nltk_result_leaf in nltk_result.leaves():
name += nltk_result_leaf[0] + ' '
print ('Type: ', nltk_result.label(), 'Name: ', name)
The output I get from the following code above is as follows:
Type: PERSON Name: Elon
Type: GPE Name: Musk
Type: PERSON Name: Jeff Bezos
Type: ORGANIZATION Name: Barkevious Mingo
This is not correct. First of all, Some names are broken up. Farily common ones, too, like Elon Musk. Next, all names are not identified. The desired output would be:
Type: PERSON Name: Elon Musk
Type: PERSON Name: Jeff Bezos
Type: PERSON Name: Reshma Saujani
Type: PERSON Name: Barkevious Mingo
Is there a better option in python?
Upvotes: 1
Views: 1484
Reputation:
You could give a try to Spacy
import spacy
from spacy import displacy
NER = spacy.load("en_core_web_lg")
raw_text = "Elon Musk 889-888-8888 [email protected] Jeff Bezos (345)123-1234 [email protected] Reshma Saujani [email protected] 888-888-8888 Barkevious Mingo"
text = NER(raw_text)
for word in text.ents:
print(word.text,word.label_)
Output:
Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON
Upvotes: 2