Calculating most frequently occuring row-specific combinations among dataframe in Python

Question

I have a dataframe that contains text separated by a comma

1  a,b,c,d
2  a,b,e,f
3  a,b,e,f

I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be

(a,b,3) (e,f,2)

The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe. Below is what I have so far according to what I have in mind.

import pandas as pd   
from io import StringIO   

StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""") 

df = pd.read_csv(StringData, sep =";")   

for index, row in df.iterrows():
    (somehow get and store all possible 2 word combos?)

David M. · Accepted Answer

You can do it this way:

import numpy as np
import pandas as pd
from io import StringIO

StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")

df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)

print(df.iloc[:2, :])

Output:

  combinations
0    [a, b, 3]
2    [b, e, 2]

Calculating most frequently occuring row-specific combinations among dataframe in Python

Answers (1)

Related Questions