Reputation: 23
I have a large DataFrame (+ 100.000 rows) that looks like this:
userId conversion revenue channel
000001 0 0 Direct
000001 1 15 Organic Search
000001 0 0 Direct
000002 0 0 Social
000002 0 0 Direct
...
and need to group multiple column values (conversion, revenue and channel) in a list based on the userId.
userId conversion revenue channel
000001 [0, 1, 0] [0, 15, 0] [Direct, Organic Search, Direct]
000002 [0, 0] [0, 0] [Social, Direct]
...
I currently make use of the pandas groupby and aggregate the values to a list, but this is quite slow (execution time around 30 seconds):
grouped = df.groupby(['userId']).agg({
'channel': lambda x: x.tolist(),
'conversion': lambda x: x.tolist(),
'revenue': lambda x: x.tolist()
})
Is there a faster (and better?) way to achieve my question?
Edit:
I think i've found a better way, this cuts the runtime down from +-30s to +-5s but I feel like it can still be more optimized.
channel = df.groupby(['userId']).agg({'channel': '>'.join})['channel'].str.split('>')
conversion = df.groupby(['userId').agg({'conversion': lambda x: '>'.join(map(str, x))})['revenue'].str.split('>')
revenue = df.groupby(['userId']).agg({'revenue': lambda x: '>'.join(map(str, x))})['revenue'].str.split('>')
g_channel_conv = pd.merge(channel, conversion, how='left', on='user')
grouped = pd.merge(g_channel_conv, revenue, how='left', on='user')
Upvotes: 2
Views: 152
Reputation: 862601
You can try:
grouped = (df.groupby('userId').agg({'channel': '>'.join,
'conversion': lambda x: '>'.join(map(str, x)),
'revenue': lambda x: '>'.join(map(str, x))}))
f = lambda x: x.split('>')
for c in grouped.columns:
grouped[c] = grouped[c].apply(f)
Upvotes: 1