pandas string comparison by splitting using value counts

Question

I'm new to using pandas, and I'm attempting to create a method of matching strings in pandas that have low value_counts, and then changing them to match similar strings with higher value counts. I've tried splitting the strings, but I can't figure out the next part. What I want to do is check if the split strings which are longer than 2 characters exist in the array with higher value counts, and then take the highest value count match, so that all matching strings are the same for that column. There's about 18 different value counts, and my aim is to get this to the lowest number which should be about 10 or 11. It's from a larger dataframe, and there are a lot of similar groups to repeat this on. This is what I've done so far.

vc = data['event_name'].value_counts()
   str_arr = []
   for v in vc[vc < 10].index:
       str_arr.append(v.split())

Then I can manually check for the strings:

data[data['event_name'].str.contains(str1, str2)]

I'm not sure how to match and update in the data frame using a loop, and also ensure that the low value_count strings aren't included in the strings to match.

EFT · Accepted Answer

If you start with

vc = data.merge(data['event_name'].value_counts().reset_index(),
                left_on='event_name', right_on='index', how='left')

to get the value_counts associated with each row of the initial dataframe, and replace your setup step with*

vc['long words'] = vc['event_name'].str.replace('\s\S\S?\s|\A\S\S?\s|\s\S\S?\z',
                                                ' ').str.strip()

to create a field with just the longer words, then you can follow up with

vc_max = vc.sort_values('event_name_y', ascending=False).drop_duplicates('long words')

to identify the most frequent value for each set of matching longer words, and use

vc.merge(vc_max, on='long words', how='left')

to match these to each row, which, since the index has remained the same, can be assigned with

data['event_name'] = vc.merge(vc_max, on='long words', how='left')['event_name_x_y']

*If you want to stick with lists/don't like regex, the below would also work

    vc['long words'] = [' '.join([string for string in split if len(string) > 2])
                        for split in vc['event_name'].str.split().tolist()]

pandas string comparison by splitting using value counts

Answers (1)

Related Questions