Reputation: 345
I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe. Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
Upvotes: 2
Views: 57
Reputation: 4588
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]
Upvotes: 1