Jeremy Schutte
Jeremy Schutte

Reputation: 382

How to select row values in pandas from one column based on these row values satisfying some condition in another column everywhere they appear

The title is confusing.

So, say I have a dataframe with one column, id, which occurs multiple times throughout my dataframe. Then I have another column, lets call it cumulativeOccurrences.

How do I select all unique occurrences of id such that the other column fulfills a certain condition, say cumulativeOccurrences > 20 for each and every instance of that id?

The beginning of the code is probably something like this:

dataframe.groupby('id')

But I can't figure out the rest.

Here is a sample small dataset that should return zero values:

id            cumulativeOccurrences
5494178       136
5494178        71
5494178        18
5494178        83
5494178        57
5494178       181
5494178        13
5494178        10
5494178        90
5494178      4484

Okay, here is the result I got after more muddling around:

res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
ids = res[res.cumulativeOccurrences['<lambda>']==True].index

This gives me a list of ids which fulfill the condition. There probably is a better way than the list comprehension lambda function for the agg function, though. Any ideas?

Upvotes: 1

Views: 73

Answers (1)

jezrael
jezrael

Reputation: 863156

First filter and then use DataFrameGroupBy.all:

res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')

EDIT1:

First timings are for non sorted id and second for sorted.

np.random.seed(123)
N = 10000000

df = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
                   columns=['id','cumulativeOccurrences'])
print (df.head())
In [125]: %%timeit
     ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 1.22 s per loop

In [126]: %%timeit
     ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
     ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
     ...: 
1 loop, best of 3: 3.69 s per loop

In [127]: %timeit

In [128]: %%timeit
     ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 3.63 s per loop

np.random.seed(123)
N = 10000000

df = pd.DataFrame({'id': np.random.randint(1000, size=N),
                   'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
                   columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())
In [130]: %%timeit
     ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 795 ms per loop

In [131]: %%timeit
     ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
     ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
     ...: 
1 loop, best of 3: 3.23 s per loop

In [132]: %%timeit
     ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
     ...: ids = res.index[res]
     ...: 
1 loop, best of 3: 3.15 s per loop

Conclusion - Sorting id and unique index can improve performance. Also data was tested in 0.20.3 version under python 3.

Upvotes: 2

Related Questions