Jojo
Jojo

Reputation: 1002

How to join or merge two dataframes based on different condition?

I want to merge or join two DataFrames based on different date. Join Completed date with any earlier Start date. I have the following dataframes:

df1:
Complted_date
2015
2017
2020


df2:
Start_date
2001
2010
2012
2015
2016
2017
2018
2019
2020
2021


And desired output is:


Complted_date   Start_date
2015            2001
2015            2010
2015            2012
2015            2015
2017            2001
2017            2010
2017            2012
2017            2015
2017            2016
2017            2017
2020            2001
2020            2010
2020            2012
2020            2015
2020            2016
2020            2017
2020            2018
2020            2019
2020            2020

I've tried but I'm not getting the output I want.

Thank you for your help!!

Upvotes: 1

Views: 91

Answers (4)

sammywemmy
sammywemmy

Reputation: 28729

You could use conditional_join from pyjanitor to get rows where compltd_date is >= start_date:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(df2, ('Complted_date', 'Start_date', '>='))
Out[1163]: 
            left      right
   Complted_date Start_date
0           2015       2001
1           2015       2010
2           2015       2012
3           2015       2015
4           2017       2001
5           2017       2010
6           2017       2012
7           2017       2015
8           2017       2016
9           2017       2017
10          2020       2001
11          2020       2010
12          2020       2012
13          2020       2015
14          2020       2016
15          2020       2017
16          2020       2018
17          2020       2019
18          2020       2020

Under the hood, it is just binary search (searchsorted) - the aim is to avoid a cartesian join, and hopefully, reduce memory usage.

Upvotes: 0

rhug123
rhug123

Reputation: 8778

Here is another way using pd.Series() and explode()

df1['Start_date'] = pd.Series([df2['Start_date'].tolist()])
df1['Start_date'] = df1['Start_date'].fillna(method='ffill')
df1.explode('Start_date').loc[lambda x: x['Complted_date'].ge(x['Start_date'])].reset_index(drop=True)

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can do cross-join and pick records which have Completed_date > Start_date:

Use df.merge with df.query:

In [101]: df1['tmp'] = 1

In [102]: df2['tmp'] = 1

In [107]: res = df1.merge(df2, how='outer').query("Complted_date >= Start_date").drop('tmp', 1)

In [108]: res
Out[108]: 
    Complted_date  Start_date
0            2015        2001
1            2015        2010
2            2015        2012
3            2015        2015
10           2017        2001
11           2017        2010
12           2017        2012
13           2017        2015
14           2017        2016
15           2017        2017
20           2020        2001
21           2020        2010
22           2020        2012
23           2020        2015
24           2020        2016
25           2020        2017
26           2020        2018
27           2020        2019
28           2020        2020

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150805

Check out merge, which gives you the expected output:

(df1.assign(key=1)
   .merge(df2.assign(key=1), on='key')
   .query('Complted_date>=Start_date')
   .drop('key', axis=1)
)

Output:

    Complted_date  Start_date
0            2015        2001
1            2015        2010
2            2015        2012
3            2015        2015
10           2017        2001
11           2017        2010
12           2017        2012
13           2017        2015
14           2017        2016
15           2017        2017
20           2020        2001
21           2020        2010
22           2020        2012
23           2020        2015
24           2020        2016
25           2020        2017
26           2020        2018
27           2020        2019
28           2020        2020

However, you might want to check out merge_asof:

pd.merge_asof(df2, df1, 
              right_on='Complted_date', 
              left_on='Start_date', 
              direction='forward')

Output:

   Start_date  Complted_date
0        2001         2015.0
1        2010         2015.0
2        2012         2015.0
3        2015         2015.0
4        2016         2017.0
5        2017         2017.0
6        2018         2020.0
7        2019         2020.0
8        2020         2020.0
9        2021            NaN

Upvotes: 3

Related Questions