Reputation: 1031

How to effeciently count string aliases?

I'm working on a personal project that counts instances of names mentioned in text. I know I can do it with collections.Counter() but I'm not sure how to account for aliases efficiently.

So for example let's say one of the names I want to count is "Tim" but I would also like to count any nicknames he has like "Timmy" and "Timster".

I have some strings saying, "Oh Tim is going to the party?", "Yeah, my boy Timmy, wouldn't miss it, he loves to party!", "Whoa, the Timster himself is going? Count me in!"

Which I'd like to all count as under a variable like "Tim". I know I could simply individually count them all and then add the counts together. But I feel like there is a better way I could do it.

I.E. I want my code to look more like.

names = {
    'Tim':{'Tim', 'Timmy', 'Timster'},
    ... other names here.}
# add any occurrence of Tim names to Tim and other occurrences of other names to their main name.

As opposed to something like

total_tim = Counter(tim) + Counter(timmy) + Counter(timster), etc..

for each name. Does anyone have any idea how I would go about doing this?

Upvotes: 1

Answers (3)

W Stokvis

Reputation: 1439

Here's a really simple solution using regex.

What's good about this solution is you don't have to explicitly name the variations. If you know the beginning variations of that person's first name, you should be fine.

from collections import Counter
import re

TEXT = '''
    Blah Tim blah blah Timmy blah Timster blah Tim
    Blah Bill blah blah William blah Billy blah Bill Bill
'''

tim_search = '(Tim([a-z]*)?(?=\ ?))'
bill_search = '((B|W)ill([a-z]*)?(?=\ ?))'
def name_counter(regex_string): 
   return Counter([i for i, *j in re.findall(regex_string, TEXT)])

name_counter(tim_search)
Counter({'Tim': 2, 'Timmy': 1, 'Timster': 1})

name_counter(bill search)
Counter({'Bill': 3, 'Billy': 1, 'William': 1})

Upvotes: 0

rawwar

Reputation: 5012

using regex will help solve this.

import re
your_dict = {"Tim":["Tim","Timmy","Timster"]}
s = "Oh Tim is going to the party? Yeah, my boy Timmy, wouldn't miss it, he loves to party! Whoa, the Timster himself is going? Count me in!"
for each in your_dict:
    print(each,"count = ", len(re.findall("|".join(sorted(your_dict[each],reverse=True)),s)))

If you want to ignore case, then just use re.IGNORECASE parameter in the re.findall

Upvotes: 1

FMc

Reputation: 42421

from collections import Counter

TEXT = '''
    Blah Tim blah blah Timmy blah Timster blah Tim
    Blah Bill blah blah William blah Billy blah Bill Bill
'''
words = TEXT.split()

# Base names a their aliases.
ALIASES = dict(
    Tim = {'Tim', 'Timmy', 'Timster'},
    Bill = {'Bill', 'William', 'Billy'},
)

# Given any name, find its base name.
BASE_NAMES = {a : nm for nm, aliases in ALIASES.items() for a in aliases}

# All names.
ALL_NAMES = set(nm for aliases in ALIASES.values() for nm in aliases)

# Count up all names.
detailed_tallies = Counter(w for w in words if w in ALL_NAMES)

# Then build the summary counts from those details.
summary_tallies = Counter()
for nm, n in detailed_tallies.items():
    summary_tallies[BASE_NAMES[nm]] += n

print(detailed_tallies)
print(summary_tallies)

# Counter({'Bill': 3, 'Tim': 2, 'Timmy': 1, 'Timster': 1, 'William': 1, 'Billy': 1})
# Counter({'Bill': 5, 'Tim': 4})

Upvotes: 2

How to effeciently count string aliases?

Answers (3)

Related Questions