Reputation: 13
Can anyone tell me the pandas equivalent of the following SQL query.
Consider sales_data
as tablename/pd.DataFrame
SELECT store_id, store_name, sales FROM sales_data WHERE sales = (SELECT max(sales) FROM sales_data WHERE store_location = 'Beijing') and store_location = 'Beijing'
I know, we can achieve this in 2 steps:
1) df = sales_data[sales_data['store_location'] == 'Beijing'][['store_id', 'store_name', 'sales']]
2) df[df['sales'] == df['sales'].max()]
But can we achieve it in a single step?? Is there a way.
Upvotes: 1
Views: 708
Reputation: 210832
I wouldn't try to do it in one step in this case, because it might be slower:
Demo:
Data:
In [103]: df
Out[103]:
a b c x
0 2 1 1 b
1 2 3 2 a
2 4 1 3 c
3 3 2 3 b
4 2 1 4 c
5 1 3 1 c
6 2 3 0 a
7 2 3 2 b
8 4 2 4 a
9 4 1 1 b
One-step solution:
In [104]: df.ix[(df.x == 'a') & (df.b == df.ix[df.x == 'a', 'b'].max())]
Out[104]:
a b c x
1 2 3 2 a
6 2 3 0 a
Timing comparison against 1M rows DF:
In [105]: big = pd.concat([df] * 10**5, ignore_index=True)
In [106]: big.shape
Out[106]: (1000000, 4)
In [111]: %%timeit
.....: x = big.ix[big.x == 'a']
.....: x.ix[x.b == x.b.max()]
.....:
10 loops, best of 3: 189 ms per loop
In [112]: %timeit big.ix[(big.x == 'a') & (big.b == big.ix[big.x == 'a', 'b'].max())]
1 loop, best of 3: 321 ms per loop
Conclusion: your 2-step method is almost 2 times faster
OLD incorrect answer:
In [115]: df.ix[df.x == 'a'].nlargest(1, columns=['b'])
Out[115]:
a b c x
1 2 3 2 a
NOTE: this answer is incorrect because it will always return only one row, even if there will be multiple rows satisfying this condition: column = max(column)
Explanation:
In [114]: df.ix[df.x == 'a']
Out[114]:
a b c x
1 2 3 2 a
6 2 3 0 a
8 4 2 4 a
correct answer:
In [116]: df.ix[(df.x == 'a') & (df.b == df.ix[df.x == 'a', 'b'].max())]
Out[116]:
a b c x
1 2 3 2 a
6 2 3 0 a
Upvotes: 0