gtomer
gtomer

Reputation: 6564

How can I speed groupby?

I have the following dataframe:

import pandas as pd
array = {'id': [1, 1, 1, 2, 2, 2, 3, 3], 'A': [False, False, True, False, False, False, True, True],
         'B': [False, True, True, False, True, False, False, False]}
df = pd.DataFrame(array)
df

I want to represent each id in one line. If all values of this id on a specific column are False, then its value should be False. If at least one is True - then it should be True. I have started with:

df.groupby(['id']).sum()

After which I will convert every value above 0 to 1. This works fine, but my original dataframe has 2,000,000 rows and 14,000 columns and therefore it takes days....

Any other quicker idea to do the task?

Upvotes: 3

Views: 318

Answers (1)

Arne
Arne

Reputation: 10545

You can save the second step by taking the maximum instead of the sum over each group:

df.groupby(['id']).max()

You might expect that it should be faster to aggregate with any, because then each sub-series would only have to be evaluated until the first True is reached, but apparently the following is much slower (as per DYZ's comment):

df.groupby(['id']).agg(any)

So I would recommend testing it with NumPy's any function, which might handle this better:

import numpy as np
df.groupby(['id']).agg(np.any)

Or you could do without agg, as suggested by Henry Ecker:

df.groupby(['id']).any()

However, none of these methods is as fast as one might expect. So maybe your best bet is to convert the dataframe to a NumPy integer array and do the grouping in pure NumPy. See this question.

Upvotes: 3

Related Questions