Reputation: 717
I want to perform filtering on Sales column such that for any Make-Auction group there should be atleast one sales is >= 100. So for Acura, Copart has 101 sales so both rows for Acura would be expected in output. For BMW, sales are <100 for both Copart and IAA, so it will be filtered out.
Dataframe:
Make Auction Sales
Acura Copart 101
Acura IAA 88
BMW Copart 50
BMW IAA 60
Buick Copart 130
Buick IAA 140
Expected Output:
Make Auction Sales
Acura Copart 101
Acura IAA 88
Buick Copart 130
Buick IAA 140
I am able to apply a filter of >100 on entire Sales column, but it is not what I wanted. Any suggestion on how to perform this? Thanks!
Upvotes: 2
Views: 166
Reputation: 294348
Option 1
transform
Using 'any'
returns True
if any element is True
within the group and broadcasts it across all indices in the group.
df[df.Sales.ge(100).groupby([df.Make]).transform('any')]
Make Auction Sales
0 Acura Copart 101
1 Acura IAA 88
4 Buick Copart 130
5 Buick IAA 140
Option 2
pd.factorize
+ np.bincount
We use np.bincount
to increment the bins from pd.factorize
with the truth values determined by df.Sales.values >= 100
. If the bin is greater than 0
then we should take every element in the group defined by that bin. We can get the appropriate array by slicing again by f
.
This is very analogous to option 1.
f, u = pd.factorize(df.Make.values)
w = df.Sales.values >= 100
df[(np.bincount(f, w) > 0)[f]]
Make Auction Sales
0 Acura Copart 101
1 Acura IAA 88
4 Buick Copart 130
5 Buick IAA 140
Upvotes: 2
Reputation: 862841
Use filtration:
df = df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
print (df)
Make Auction Sales
0 Acura Copart 101
1 Acura IAA 88
4 Buick Copart 130
5 Buick IAA 140
Another solution with loc
and boolean indexing
for Make
values which are filtered by isin
:
print (df.loc[df['Sales'] >= 100, 'Make'])
0 Acura
4 Buick
5 Buick
Name: Make, dtype: object
print (df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make']))
0 True
1 True
2 False
3 False
4 True
5 True
Name: Make, dtype: bool
df = df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
print (df)
Make Auction Sales
0 Acura Copart 101
1 Acura IAA 88
4 Buick Copart 130
5 Buick IAA 140
Second solution is faster:
np.random.seed(123)
N = 1000000
L = list('abcdefghijklmno')
df = pd.DataFrame({'Make': np.random.choice(L, N),
'Sales':np.random.randint(110, size=N)})
print (df)
In [59]: %timeit df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
10 loops, best of 3: 55.6 ms per loop
#Alexander answer
In [60]: %timeit df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
10 loops, best of 3: 65 ms per loop
In [61]: %timeit df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
1 loop, best of 3: 217 ms per loop
#piRSquared solution 1
In [62]: %timeit df[df.Sales.ge(100).groupby([df.Make]).transform('any')]
1 loop, best of 3: 135 ms per loop
#piRSquared solution 2
In [63]: %%timeit
...: f, u = pd.factorize(df.Make.values)
...: w = df.Sales.values >= 100
...: df[(np.bincount(f, w) > 0)[f]]
...:
10 loops, best of 3: 67.2 ms per loop
Upvotes: 2
Reputation: 109546
Filter the dataframe for the records where sales >= 100, then take the unique Make
of cars. Finally, use boolean indexing if any make is in this filtered set.
>>> df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
Make Auction Sales
0 Acura Copart 101
1 Acura IAA 88
4 Buick Copart 130
5 Buick IAA 140
Upvotes: 2