faster dataframe groupby rows to list

Question

I have a large DataFrame (+ 100.000 rows) that looks like this:

userId conversion revenue channel 
000001 0          0       Direct
000001 1          15      Organic Search
000001 0          0       Direct
000002 0          0       Social
000002 0          0       Direct
...

and need to group multiple column values (conversion, revenue and channel) in a list based on the userId.

userId conversion revenue    channel 
000001 [0, 1, 0]  [0, 15, 0] [Direct, Organic Search, Direct]
000002 [0, 0]     [0, 0]     [Social, Direct]
...

I currently make use of the pandas groupby and aggregate the values to a list, but this is quite slow (execution time around 30 seconds):

grouped = df.groupby(['userId']).agg({
    'channel': lambda x: x.tolist(),
    'conversion': lambda x: x.tolist(),
    'revenue': lambda x: x.tolist()
})

Is there a faster (and better?) way to achieve my question?

Edit:

I think i've found a better way, this cuts the runtime down from +-30s to +-5s but I feel like it can still be more optimized.

channel = df.groupby(['userId']).agg({'channel': '>'.join})['channel'].str.split('>')
conversion = df.groupby(['userId').agg({'conversion': lambda x: '>'.join(map(str, x))})['revenue'].str.split('>')
revenue = df.groupby(['userId']).agg({'revenue': lambda x: '>'.join(map(str, x))})['revenue'].str.split('>')

g_channel_conv = pd.merge(channel, conversion, how='left', on='user')
grouped = pd.merge(g_channel_conv, revenue, how='left', on='user')

jezrael · Accepted Answer

You can try:

grouped = (df.groupby('userId').agg({'channel': '>'.join,
                                    'conversion': lambda x: '>'.join(map(str, x)),
                                    'revenue': lambda x: '>'.join(map(str, x))}))

f = lambda x: x.split('>')
for c in grouped.columns:
    grouped[c] = grouped[c].apply(f)

faster dataframe groupby rows to list

Answers (1)

Related Questions