Todd
Todd

Reputation: 439

how to remove names of people in corpus using python

I've been searching for this for a long time and most of the materials I've found were entity named recognition. I'm running topic modeling but in my data, there were too many names in the texts.
Is there any python library which contains (English) names of people? or if not, what would be a good way to remove names of people from each document in corpus? Here's a simple example:

texts=['Melissa\'s home was clean and spacious. I would love to visit again soon.','Kevin was nice and Kevin\'s home had a huge parking spaces.'] 

Upvotes: 3

Views: 9103

Answers (2)

Yuri Khristich
Yuri Khristich

Reputation: 14537

Not sure if this solution is efficient and robust but it's simple to understand (to me at the very least):

import re

# get a list of existed names (over 18 000) from the file
with open('names.txt', 'r') as f:
    NAMES = set(f.read().splitlines())

# your list of texts
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
"Kevin was nice and Kevin's home had a huge parking spaces."]

# join the texts into one string
texts = ' | '.join(texts)

# find all the words that look like names
pattern = r"(\b[A-Z][a-z]+('s)?\b)"
found_names = re.findall(pattern, texts)

# get singular forms, and remove doubles
found_names = set([name[0].replace("'s","") for name in found_names])

# remove all the words that look like names but are not included in the NAMES
found_names = [name for name in found_names if name in NAMES]

# loop trough the found names and remove every name from the texts
for name in found_names:
    texts = re.sub(name + "('s)?", "", texts) # include plural forms

# split the texts back to the list
texts = texts.split(' | ')

print(texts) 

Output:

[' home was clean and spacious. I would love to visit again soon.',
' was nice and  home had a huge parking spaces.']

List of the names was obtained here: https://www.usna.edu/Users/cs/roche/courses/s15si335/proj1/files.php%3Ff=names.txt.html

And I completely endorse the recommendation of @James_SO to use more smart tools.

Upvotes: 0

James_SO
James_SO

Reputation: 1387

I would suggest using a tokenizer with some capability to recognize and differentiate proper nouns. spacy is quite versatile and its default tokenizer does a decent job of this.

There are hazards to using a list of names as if they're stop words - let me illustrate:

import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm")
texts=["Melissa's home was clean and spacious. I would love to visit again soon.",
       "Kevin was nice and Kevin's home had a huge parking spaces."
      "Bill sold a work of art to Art and gave him a bill"]
tokenList = []
for i, sentence in enumerate(texts):
    doc = nlp(sentence)
    for token in doc:
        tokenList.append([i, token.text, token.lemma_, token.pos_, token.tag_, token.dep_])
tokenDF = pd.DataFrame(tokenList, columns=["i", "text", "lemma", "POS", "tag", "dep"]).set_index("i")

So the first two sentences are easy, and spacy identifies the proper nouns "PROPN": enter image description here

Now, the third sentence has been constructed to show the issue - lots of people have names that are also things. spacy's default tokenizer isn't perfect, but it does a respectable job with the two sides of the task: don't remove names when they are being used as regular words (e.g. bill of goods, work of art), and do identify them when they are being used as names. (you can see that it messed up one of the references to Art (the person).

enter image description here

Upvotes: 6

Related Questions