Reputation: 33
I want to get the name of person from text file i use the nltk it returns name as well as the word which is not name:
def extract_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = pos_tag(tokens)
sentt = ne_chunk(pos, binary = False)
person_list = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
name = remove_useless_name(name)
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
return person_list
i want to remove that word which is not name which method should i use for removing the word. Input like
"Sunder Pichai"
"View Profile"
"Risk Management"
sample output:
"Sunder Pichai"
Upvotes: 1
Views: 1949
Reputation: 57085
NLTK provides the corpora of the most common English words (nltk.corpus.words.words('en')
) and most common English names (nltk.corpus.names.words()
). Unfortunately, the latter one would not have
Sunder or Pichai, so you have to rely on the former. Unfortunately again, there are names that are also common English words (e.g., Hope), which makes the task even more challenging. You can still automate it to some extent:
words = set(nltk.corpus.words.words('en'))
def isname1(string):
return any([w not in words for w in string.lower().split()])
def isname2(string):
return all([w not in words for w in string.lower().split()])
list(map(isname1, ["Sunder Pichai", "View Profile", "Risk Management"]))
#[True, False, False]
list(map(isname2, ["Sunder Pichai", "View Profile", "Risk Management"]))
#[False, False, False]
As you can see, the second function is more aggressive and does not recognize "Sunder Pichai" as a name (because "sunder" is actually an English word).
Upvotes: 2
Reputation: 125
Maybe use a dictionary, and check whether all parts of the name is a real word and/or surname is a known name
Upvotes: 0