Reputation: 382
The title is confusing.
So, say I have a dataframe with one column, id
, which occurs multiple times throughout my dataframe. Then I have another column, lets call it cumulativeOccurrences
.
How do I select all unique occurrences of id such that the other column fulfills a certain condition, say cumulativeOccurrences > 20
for each and every instance of that id
?
The beginning of the code is probably something like this:
dataframe.groupby('id')
But I can't figure out the rest.
Here is a sample small dataset that should return zero values:
id cumulativeOccurrences
5494178 136
5494178 71
5494178 18
5494178 83
5494178 57
5494178 181
5494178 13
5494178 10
5494178 90
5494178 4484
Okay, here is the result I got after more muddling around:
res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
ids = res[res.cumulativeOccurrences['<lambda>']==True].index
This gives me a list of ids which fulfill the condition. There probably is a better way than the list comprehension lambda function for the agg function, though. Any ideas?
Upvotes: 1
Views: 73
Reputation: 863156
First filter and then use DataFrameGroupBy.all
:
res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')
EDIT1:
First timings are for non sorted id
and second for sorted.
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences'])
print (df.head())
In [125]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 1.22 s per loop
In [126]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.69 s per loop
In [127]: %timeit
In [128]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.63 s per loop
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())
In [130]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 795 ms per loop
In [131]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.23 s per loop
In [132]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.15 s per loop
Conclusion - Sorting id
and unique index can improve performance. Also data was tested in 0.20.3
version under python 3
.
Upvotes: 2