Mac
Mac

Reputation: 123

Pandas Equivalent of SQL's "Where In" with lists/tuples/iterables

I'm having a hell of a time trying to find an answer to this, which I feel like should be straight forward.

What is the Pandas equivalent of SELECT * FROM [TABLE] WHERE [COL1] IN (LIST)?

Basically, I have a large dataset of ID numbers, and I want to select a certain subset of these customers. I want the entire dataframe's worth of values. I tried using .isin(), but I do not want boolean values. Is there a simple way to use another Dataframe column/Series/list/tuple (whatever) to select rows in another Dataframe?

For example, I have two tables/DataFrames:

Let's call the first table cust_info:

cust_id metadata1
1234 y
5678 x
4321 x
2468 y
9513 x
1473 y

The second table will be called cust_orders

cust_id order_date order_total
1234 2021-07-10 15.10
5678 2021-07-11 480.52
4321 2021-07-11 72.71
2468 2021-07-12 50.96
9513 2021-07-12 69.20
1473 2021-07-13 75.58

I'd like to use customers who's metadata1 value == 'y' in cust_info. I've created a list (or Series, or Tuple, or Set, whatever I need) using DF1[DF1['metadata1'] == 'y'] and extracting the cust_id from that to create the list.

In this case, I want the entire row (including order_date and order_total) with cust_id == [1234, 2468, 1473]

So the SQL equivalent would be:

SELECT * FROM [cust_orders]
WHERE [cust_id] in 
(
SELECT * FROM [cust_info] where [metadata1] = 'y'
) 

I know I can slice using AND/OR and &/|, but I have hundreds of values so manually typing each ID is out of the question. How would you guys do this? I feel like I'm missing something super obvious...

Thanks!

Upvotes: 1

Views: 324

Answers (2)

ifly6
ifly6

Reputation: 5331

Make a list of the customer IDs where cust_info['metadata1'] == 'y':

valids = cust_info.loc[cust_info['metadata1'] == 'y', 'cust_id']

Then ask whether the ID is in that list of valid IDs:

cust_orders[cust_orders['custid'].isin(valids)]

Upvotes: 3

Andrej Kesely
Andrej Kesely

Reputation: 195418

Try:

x = cust_info[cust_info["metadata1"].eq("y")].merge(cust_orders, on="cust_id")
print(x)

Prints:

   cust_id metadata1  order_date  order_total
0     1234         y  2021-07-10        15.10
1     2468         y  2021-07-12        50.96
2     1473         y  2021-07-13        75.58

If you want only columns from cust_orders:

x = cust_info[cust_info["metadata1"].eq("y")].merge(cust_orders, on="cust_id")
print(x[cust_orders.columns])

   cust_id  order_date  order_total
0     1234  2021-07-10        15.10
1     2468  2021-07-12        50.96
2     1473  2021-07-13        75.58

Upvotes: 1

Related Questions