Shaunak
Shaunak

Reputation: 717

Perform conditional filtering based on grouped column value in Pandas Python

I want to perform filtering on Sales column such that for any Make-Auction group there should be atleast one sales is >= 100. So for Acura, Copart has 101 sales so both rows for Acura would be expected in output. For BMW, sales are <100 for both Copart and IAA, so it will be filtered out.

Dataframe:

Make    Auction Sales   
Acura   Copart  101
Acura   IAA     88  
BMW     Copart  50  
BMW     IAA     60
Buick   Copart  130 
Buick   IAA     140

Expected Output:

Make    Auction Sales   
Acura   Copart  101
Acura   IAA         88  
Buick   Copart  130 
Buick   IAA     140

I am able to apply a filter of >100 on entire Sales column, but it is not what I wanted. Any suggestion on how to perform this? Thanks!

Upvotes: 2

Views: 166

Answers (3)

piRSquared
piRSquared

Reputation: 294348

Option 1
transform
Using 'any' returns True if any element is True within the group and broadcasts it across all indices in the group.

df[df.Sales.ge(100).groupby([df.Make]).transform('any')]

    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Option 2
pd.factorize + np.bincount
We use np.bincount to increment the bins from pd.factorize with the truth values determined by df.Sales.values >= 100. If the bin is greater than 0 then we should take every element in the group defined by that bin. We can get the appropriate array by slicing again by f.

This is very analogous to option 1.

f, u = pd.factorize(df.Make.values)
w = df.Sales.values >= 100
df[(np.bincount(f, w) > 0)[f]]

    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Upvotes: 2

jezrael
jezrael

Reputation: 862841

Use filtration:

df = df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
print (df)
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Another solution with loc and boolean indexing for Make values which are filtered by isin:

print (df.loc[df['Sales'] >= 100, 'Make'])
0    Acura
4    Buick
5    Buick
Name: Make, dtype: object

print (df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make']))
0     True
1     True
2    False
3    False
4     True
5     True
Name: Make, dtype: bool

df = df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
print (df)
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Second solution is faster:

np.random.seed(123)
N = 1000000
L = list('abcdefghijklmno')
df = pd.DataFrame({'Make': np.random.choice(L, N),
                   'Sales':np.random.randint(110, size=N)})
print (df)

In [59]: %timeit df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
10 loops, best of 3: 55.6 ms per loop

#Alexander answer
In [60]: %timeit df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
10 loops, best of 3: 65 ms per loop

In [61]: %timeit df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
1 loop, best of 3: 217 ms per loop

#piRSquared solution 1
In [62]: %timeit df[df.Sales.ge(100).groupby([df.Make]).transform('any')]
1 loop, best of 3: 135 ms per loop

#piRSquared solution 2
In [63]: %%timeit
    ...: f, u = pd.factorize(df.Make.values)
    ...: w = df.Sales.values >= 100
    ...: df[(np.bincount(f, w) > 0)[f]]
    ...: 
10 loops, best of 3: 67.2 ms per loop

Upvotes: 2

Alexander
Alexander

Reputation: 109546

Filter the dataframe for the records where sales >= 100, then take the unique Make of cars. Finally, use boolean indexing if any make is in this filtered set.

>>> df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Upvotes: 2

Related Questions