Reputation: 21
I am trying to extract all the first and last names from a magazine article (I called it example.txt
). I am doing it in two parts.
In the first part, I extract a string that consists of two words, each one starting with a capital letter with a space in between and I do this using regex. I make a list of all these strings and I call this list all_names
. This gives me all possible names like "Barack Obama", but also "The President".
In the second part, I split the string and take the first part of each name, let's say "Barack" of "Barack Obama" and I want to check if "Barack" is in the list of first names that I prepared ahead of time (I called it first_names.txt
). If there is a match, and only if there is a match, then I add it to a new array, which is supposed to have only those names that matched with those in first_names.txt
.
So in theory, "Barack Obama" gets into the array and "The President" does not. Unfortunately, the substring "The" from "The President" is found in first names such as "Matthew" and "Katherine" and so "The President" also gets into the array even though I don't want it to. My code is below. Any suggestions on how to resolve this?
import re
text = open('example.txt').read()
first_names = open('first_names.txt').read()
regex = re.compile("[A-Z][a-z]+\s[A-Z][\w]*")
all_names = regex.findall(text)
array = []
for name in all_names:
first = name.split(" ")[0]
if first in first_names:
if name not in array:
array.append(name)
print(array)
Upvotes: 1
Views: 55
Reputation: 12990
You could split first_names
and create a set
of those names (assuming first names in your file are separated by space):
first_names = set(open('first_names.txt').read().split())
Then if first in first_names
will check if the exact first name is in that set in O(1) time. This will also solve your problem of excluding "The President" because "The" in first_names
will return False
.
Here's what this looks like with a simple example:
first_names_text = "Barack Matthew Katherine"
first_names = set(first_names_text.split())
all_names = ['Barack Obama', 'The President', 'Katherine Swift']
array = []
for name in all_names:
first = name.split(" ")[0]
if first in first_names:
if name not in array:
array.append(name)
print(array)
# ['Barack Obama', 'Katherine Swift']
Upvotes: 1