Sebastian Goslin
Sebastian Goslin

Reputation: 497

Matching partial values in DF column to specified list and retrieving the frequency

I have a data frame of the form:

              Room Location
0                 jc room g
1               merten 3005
2               merten 2500
3               merten 3005
4               merten 3005
5               merten 3005
6                 jc bistro

I'm currently trying to parse through this specific columns to extract all values within this stated list:

room_list = ['jc','sub', 'hub', 'merten', 'rsch', 'corner pocket', 'mix', 'fenwick']

Thus (and rather inefficiently) I use:

room_list = (MASTER_TABLE['Room Location'].astype(str).str.lower()).tolist()

room_string = ''.join(room_list)

room_freq = re.findall(r'|'.join(room_list), room_string)

To reiterate, after some prepocessing, I turn the column into a list, then a string, then execute this:

freqs = {}

for item in room_freq:
    freqs[item] = freqs.get(item, 0) + 1

num_sort_freqs = dict(sorted(freqs.items(), key=lambda x: x[1], reverse = True))

print('Sorted name occurences: ','\n')
print('===================================================================\n')

for k, v in num_sort_freqs.items():
    print(k, v)

This works smashingly however, when I return the "sorted" dictionary I get this:

merten 1204 39
jc cinema 35
merten 2500 31
jc gold rm 31
the hub corner pocket 30

Frustratingly it does exactly as its told, my question is as follows:

How do I parse through the column (or list -> string I should say) and partially match an element of the column with that of the specified list, EVEN if the element has excess "noise" attached to it or for example:

jc room g = jc
merten 3005 = merten

Ideally the return dict should only have 5 keys to return, those of room_list and their subsequent frequencies. I've also tried stripping all numbers but since some of the elements have excess string noise the problem persists.

I've scoured previous related questions and didn't find anything specific to my problem, but if you find a (RELEVANT) dupe please let me know so I don't waste your time. Thanks!

Upvotes: 1

Views: 46

Answers (1)

Frenchy
Frenchy

Reputation: 17027

my solution here:

room_list = ['jc','sub', 'hub', 'merten', 'rsch', 'corner pocket', 'mix', 'fenwick']
for tag in room_list:
    result = df['Room Location'].str.lower().str.contains(tag).sum()
    print(f'{tag} : {result}')

output:

jc : 2            
sub : 0          
hub : 0          
merten : 5       
rsch : 0         
corner pocket : 0
mix : 0          
fenwick : 0 

if you want a dict as output, just do that:

result= {}
for tag in room_list:
    result[tag] = df['Room Location'].str.lower().str.contains(tag).sum()

or the solution by Chris:

result = {tag : df['Room Location'].str.lower().str.contains(tag).sum() for tag in room_list}

Upvotes: 2

Related Questions