user3570615
user3570615

Reputation: 491

Join based on multiple complex conditions in Python

I am wondering if there is a way in Python (within or outside Pandas) to do the equivalent joining as we can do in SQL on two tables based on multiple complex conditions such as value in table 1 is more than 10 less than in table 2, or only on some field in table 1 satisfying some conditions, etc.

This is for combining some fundamental tables to achieve a joint table with more fields and information. I know in Pandas, we can merge two dataframes on some column names, but such a mechanism seems to be too simple to give the desired results.

For example, the equivalent SQL code could be like:

SELECT
a.*,
b.*
FROM Table1 AS a
JOIN Table 2 AS b
ON
a.id = b.id AND
a.sales - b.sales > 10 AND
a.country IN ('US', 'MX', 'GB', 'CA')

I would like an equivalent way to achieve the same joined table in Python on two data frames. Anyone can share insights?

Thanks!

Upvotes: 1

Views: 1246

Answers (1)

Ahmed Elsafty
Ahmed Elsafty

Reputation: 579

In principle, your query could be rewritten as a join and a filter where clause.

SELECT a.*, b.*
FROM Table1 AS a
JOIN Table2 AS b
ON a.id = b.id 
WHERE a.sales - b.sales > 10 AND a.country IN ('US', 'MX', 'GB', 'CA')

Assuming the dataframes are gigantic and you don't want a big intermediate table, we can filter Dataframe A first.

import pandas as pd

df_a, df_b = pd.Dataframe(...), pd.Dataframe(...)

# since A.country has nothing to do with the join, we can filter it first.
df_a = df_a[df_a["country"].isin(['US', 'MX', 'GB', 'CA'])]

# join 
merged = pd.merge(df_a, df_b, on='id', how='inner')

# filter
merged = merged[merged["sales_x"] - merged["sales_y"] > 10]

off-topic: depending on the use case, you may want to use abs() the difference.

Upvotes: 1

Related Questions