HaSH
HaSH

Reputation: 13

what would be the pandas equivalent of a complex SQL query (consisting of a subquery)

Can anyone tell me the pandas equivalent of the following SQL query. Consider sales_data as tablename/pd.DataFrame

SELECT store_id, store_name, sales FROM sales_data WHERE sales = (SELECT max(sales) FROM sales_data WHERE store_location = 'Beijing') and store_location = 'Beijing'

I know, we can achieve this in 2 steps:

1) df = sales_data[sales_data['store_location'] == 'Beijing'][['store_id', 'store_name', 'sales']]
2) df[df['sales'] == df['sales'].max()]

But can we achieve it in a single step?? Is there a way.

Upvotes: 1

Views: 708

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

I wouldn't try to do it in one step in this case, because it might be slower:

Demo:

Data:

In [103]: df
Out[103]:
   a  b  c  x
0  2  1  1  b
1  2  3  2  a
2  4  1  3  c
3  3  2  3  b
4  2  1  4  c
5  1  3  1  c
6  2  3  0  a
7  2  3  2  b
8  4  2  4  a
9  4  1  1  b

One-step solution:

In [104]: df.ix[(df.x == 'a') & (df.b == df.ix[df.x == 'a', 'b'].max())]
Out[104]:
   a  b  c  x
1  2  3  2  a
6  2  3  0  a

Timing comparison against 1M rows DF:

In [105]: big = pd.concat([df] * 10**5, ignore_index=True)

In [106]: big.shape
Out[106]: (1000000, 4)

In [111]: %%timeit
   .....: x = big.ix[big.x == 'a']
   .....: x.ix[x.b == x.b.max()]
   .....:
10 loops, best of 3: 189 ms per loop

In [112]: %timeit big.ix[(big.x == 'a') & (big.b == big.ix[big.x == 'a', 'b'].max())]
1 loop, best of 3: 321 ms per loop

Conclusion: your 2-step method is almost 2 times faster

OLD incorrect answer:

In [115]: df.ix[df.x == 'a'].nlargest(1, columns=['b'])
Out[115]:
   a  b  c  x
1  2  3  2  a

NOTE: this answer is incorrect because it will always return only one row, even if there will be multiple rows satisfying this condition: column = max(column)

Explanation:

In [114]: df.ix[df.x == 'a']
Out[114]:
   a  b  c  x
1  2  3  2  a
6  2  3  0  a
8  4  2  4  a

correct answer:

In [116]: df.ix[(df.x == 'a') & (df.b == df.ix[df.x == 'a', 'b'].max())]
Out[116]:
   a  b  c  x
1  2  3  2  a
6  2  3  0  a

Upvotes: 0

Related Questions