Reputation: 6564
I have the following dataframe:
import pandas as pd
array = {'id': [1, 1, 1, 2, 2, 2, 3, 3], 'A': [False, False, True, False, False, False, True, True],
'B': [False, True, True, False, True, False, False, False]}
df = pd.DataFrame(array)
df
I want to represent each id in one line. If all values of this id on a specific column are False, then its value should be False. If at least one is True - then it should be True. I have started with:
df.groupby(['id']).sum()
After which I will convert every value above 0 to 1. This works fine, but my original dataframe has 2,000,000 rows and 14,000 columns and therefore it takes days....
Any other quicker idea to do the task?
Upvotes: 3
Views: 318
Reputation: 10545
You can save the second step by taking the maximum instead of the sum over each group:
df.groupby(['id']).max()
You might expect that it should be faster to aggregate with any
, because then each sub-series would only have to be evaluated until the first True
is reached, but apparently the following is much slower (as per DYZ's comment):
df.groupby(['id']).agg(any)
So I would recommend testing it with NumPy's any
function, which might handle this better:
import numpy as np
df.groupby(['id']).agg(np.any)
Or you could do without agg
, as suggested by Henry Ecker:
df.groupby(['id']).any()
However, none of these methods is as fast as one might expect. So maybe your best bet is to convert the dataframe to a NumPy integer array and do the grouping in pure NumPy. See this question.
Upvotes: 3