Reputation: 123
I'm having a hell of a time trying to find an answer to this, which I feel like should be straight forward.
What is the Pandas equivalent of SELECT * FROM [TABLE] WHERE [COL1] IN (LIST)
?
Basically, I have a large dataset of ID numbers, and I want to select a certain subset of these customers. I want the entire dataframe's worth of values. I tried using .isin()
, but I do not want boolean values. Is there a simple way to use another Dataframe column/Series/list/tuple (whatever) to select rows in another Dataframe?
For example, I have two tables/DataFrames:
Let's call the first table cust_info:
cust_id | metadata1 |
---|---|
1234 | y |
5678 | x |
4321 | x |
2468 | y |
9513 | x |
1473 | y |
The second table will be called cust_orders
cust_id | order_date | order_total |
---|---|---|
1234 | 2021-07-10 | 15.10 |
5678 | 2021-07-11 | 480.52 |
4321 | 2021-07-11 | 72.71 |
2468 | 2021-07-12 | 50.96 |
9513 | 2021-07-12 | 69.20 |
1473 | 2021-07-13 | 75.58 |
I'd like to use customers who's metadata1 value == 'y' in cust_info. I've created a list (or Series, or Tuple, or Set, whatever I need) using DF1[DF1['metadata1'] == 'y']
and extracting the cust_id from that to create the list.
In this case, I want the entire row (including order_date and order_total) with cust_id == [1234, 2468, 1473]
So the SQL equivalent would be:
SELECT * FROM [cust_orders]
WHERE [cust_id] in
(
SELECT * FROM [cust_info] where [metadata1] = 'y'
)
I know I can slice using AND/OR and &/|, but I have hundreds of values so manually typing each ID is out of the question. How would you guys do this? I feel like I'm missing something super obvious...
Thanks!
Upvotes: 1
Views: 324
Reputation: 5331
Make a list of the customer IDs where cust_info['metadata1'] == 'y'
:
valids = cust_info.loc[cust_info['metadata1'] == 'y', 'cust_id']
Then ask whether the ID is in that list of valid IDs:
cust_orders[cust_orders['custid'].isin(valids)]
Upvotes: 3
Reputation: 195418
Try:
x = cust_info[cust_info["metadata1"].eq("y")].merge(cust_orders, on="cust_id")
print(x)
Prints:
cust_id metadata1 order_date order_total
0 1234 y 2021-07-10 15.10
1 2468 y 2021-07-12 50.96
2 1473 y 2021-07-13 75.58
If you want only columns from cust_orders
:
x = cust_info[cust_info["metadata1"].eq("y")].merge(cust_orders, on="cust_id")
print(x[cust_orders.columns])
cust_id order_date order_total
0 1234 2021-07-10 15.10
1 2468 2021-07-12 50.96
2 1473 2021-07-13 75.58
Upvotes: 1