Reputation: 1031
I'm working on a personal project that counts instances of names mentioned in text. I know I can do it with collections.Counter()
but I'm not sure how to account for aliases efficiently.
So for example let's say one of the names I want to count is "Tim"
but I would also like to count any nicknames he has like "Timmy"
and "Timster"
.
I have some strings saying, "Oh Tim is going to the party?"
, "Yeah, my boy Timmy, wouldn't miss it, he loves to party!"
, "Whoa, the Timster himself is going? Count me in!"
Which I'd like to all count as under a variable like "Tim"
. I know I could simply individually count them all and then add the counts together. But I feel like there is a better way I could do it.
I.E. I want my code to look more like.
names = {
'Tim':{'Tim', 'Timmy', 'Timster'},
... other names here.}
# add any occurrence of Tim names to Tim and other occurrences of other names to their main name.
As opposed to something like
total_tim = Counter(tim) + Counter(timmy) + Counter(timster), etc..
for each name. Does anyone have any idea how I would go about doing this?
Upvotes: 1
Views: 115
Reputation: 1439
Here's a really simple solution using regex.
What's good about this solution is you don't have to explicitly name the variations. If you know the beginning variations of that person's first name, you should be fine.
from collections import Counter
import re
TEXT = '''
Blah Tim blah blah Timmy blah Timster blah Tim
Blah Bill blah blah William blah Billy blah Bill Bill
'''
tim_search = '(Tim([a-z]*)?(?=\ ?))'
bill_search = '((B|W)ill([a-z]*)?(?=\ ?))'
def name_counter(regex_string):
return Counter([i for i, *j in re.findall(regex_string, TEXT)])
name_counter(tim_search)
Counter({'Tim': 2, 'Timmy': 1, 'Timster': 1})
name_counter(bill search)
Counter({'Bill': 3, 'Billy': 1, 'William': 1})
Upvotes: 0
Reputation: 5012
using regex will help solve this.
import re
your_dict = {"Tim":["Tim","Timmy","Timster"]}
s = "Oh Tim is going to the party? Yeah, my boy Timmy, wouldn't miss it, he loves to party! Whoa, the Timster himself is going? Count me in!"
for each in your_dict:
print(each,"count = ", len(re.findall("|".join(sorted(your_dict[each],reverse=True)),s)))
If you want to ignore case, then just use re.IGNORECASE
parameter in the re.findall
Upvotes: 1
Reputation: 42421
from collections import Counter
TEXT = '''
Blah Tim blah blah Timmy blah Timster blah Tim
Blah Bill blah blah William blah Billy blah Bill Bill
'''
words = TEXT.split()
# Base names a their aliases.
ALIASES = dict(
Tim = {'Tim', 'Timmy', 'Timster'},
Bill = {'Bill', 'William', 'Billy'},
)
# Given any name, find its base name.
BASE_NAMES = {a : nm for nm, aliases in ALIASES.items() for a in aliases}
# All names.
ALL_NAMES = set(nm for aliases in ALIASES.values() for nm in aliases)
# Count up all names.
detailed_tallies = Counter(w for w in words if w in ALL_NAMES)
# Then build the summary counts from those details.
summary_tallies = Counter()
for nm, n in detailed_tallies.items():
summary_tallies[BASE_NAMES[nm]] += n
print(detailed_tallies)
print(summary_tallies)
# Counter({'Bill': 3, 'Tim': 2, 'Timmy': 1, 'Timster': 1, 'William': 1, 'Billy': 1})
# Counter({'Bill': 5, 'Tim': 4})
Upvotes: 2