Reputation: 5907
i have a dataframe like
customer genre
0 cust1 |BIOPIC|DRAMA|
1 cust2 |COMEDY|DRAMA|ROMANCE|
2 cust1 |DRAMA|THRILLER|
3 cust3 |COMEDY|HORROR|
4 cust4 |HISTORY|ROMANCE|WAR|
5 cust3 |ADVENTURE|COMEDY|
6 cust2 |ACTION|DRAMA|THRILLER|
7 cust1 |CRIME|DRAMA|THRILLER|
8 cust3 |HISTORY|ROMANCE|WAR|
9 cust2 |ADVENTURE|COMEDY|
10 cust4 |BIOPIC|DRAMA|HISTORY|THRILLER|
I need = how many times each customer did transaction(customer count) and their respective genre count.Eg. cust1 DRAMA = 3, cust1 THRILLER = 2,like wise for each customer's.
I did found the each customer count by
df = df.groupby(['cust']).size()
then i know how to filter out the genres and getting the count if it was within a LIST , but i am getting confused with how to proceed with each group of customer and getting the count for each customer's indivisual genre count.
filtering(|) from genre expression and getting the fields out.
please suggest.
Upvotes: 0
Views: 74
Reputation: 186
The feature str.get_dummies
is perfect for this sort of thing! It works just like the dataframe version but on strings and allows you to specify a delimiter. Assuming your dataframe is named df
, then the below code does what you're after:
import pandas as pd
import numpy as np
df = pd.concat([df, df.Genres.str.get_dummies(sep='|')], axis=1)
df = df.groupby("Customers").aggregate(np.sum)
print(df)
output:
ACTION ADVENTURE BIOPIC COMEDY CRIME DRAMA HISTORY HORROR \
Customers
cust1 0 0 1 0 1 3 0 0
cust2 1 1 0 2 0 2 0 0
cust3 0 1 0 2 0 0 1 1
cust4 0 0 1 0 0 1 2 0
To explain a bit, str.get_dummies
method makes a new column for every value it sees in the column specified and then marks a 1 for the values present and a 0 elsewhere. The GroupBy and Aggregate methods make clusters according to the customers and add up the columns. Aggregate will silently drop columns which it can't add, in this case the original Genres
column.
Upvotes: 1