Mcgroger
Mcgroger

Reputation: 67

Merging two dataframes based on condition in a third column



I want to merge two dataframes (df1 & df2) based on the column PERMNO resp. LPERMNO:
df1.merge(df2, left_on='PERMNO', right_on='LPERMNO')

However, when merging I also need to consider that the column date of df1 lies in between the range of the two columns LINKDT & LINKENDDT of df2, since df2 holds different information for each row (see column PRC of df2).

For example, index [0,1] of df1 should merge with index 0 of df2
and index [2,3] of df1 should merge with index 2 of df2, since the value of date is within the two columns LINKDT & LINKENDDT.

Unfortunately, I am bit lost here, so any help is appreciated!

df1

    PERMNO  date
0   66325   2006-03-30
1   66325   2006-06-30
2   66325   2015-09-30
3   66325   2015-12-30

df2:

    LPERMNO LINKDT      LINKENDDT   PRC
0   66325   1992-07-01  2014-04-30  10
1   66325   1983-09-23  1992-06-30  12
2   66325   2014-05-01  2019-12-31  8.5

UPDATE:

expected output

Output dataframe has shape 2'000'000x180.

    PERMNO  date        LPERMNO   LINKDT      LINKENDDT    PRC   
0   66325   2006-03-30  66325     1992-07-01  2014-04-30   10
1   66325   2006-06-30  66325     1992-07-01  2014-04-30   10
2   66325   2015-09-30  66325     2014-05-01  2019-12-31   8.5
3   66325   2015-12-30  66325     2014-05-01  2019-12-31   8.5

Upvotes: 0

Views: 496

Answers (2)

santiagoNublado
santiagoNublado

Reputation: 322

Try:

df_merged = df1.merge(df2, left_on=['PERMNO'], right_on=['LPERMNO'])

# if ['date', 'LINKDT', 'LINKENDDT'] already are in datetime format, ignore this three lines
df_merged['date'] = pd.to_datetime(df_merged['date'])
df_merged['LINKDT'] = pd.to_datetime(df_merged['LINKDT'])
df_merged['LINKENDDT'] = pd.to_datetime(df_merged['LINKENDDT'])

# Assuming LINKDT < LINKENDDT
df_merged = df_merged[(df_merged['date'] >= df_merged['LINKDT']) & (df_merged['date'] <= df_merged['LINKENDDT'])]

Output:

    PERMNO       date  LPERMNO     LINKDT  LINKENDDT   PRC
0    66325 2006-03-30    66325 1992-07-01 2014-04-30  10.0
3    66325 2006-06-30    66325 1992-07-01 2014-04-30  10.0
8    66325 2015-09-30    66325 2014-05-01 2019-12-31   8.5
11   66325 2015-12-30    66325 2014-05-01 2019-12-31   8.5

Upvotes: 2

Kiwi
Kiwi

Reputation: 81

I don't know if there's any simpler/faster/better way to do it, but this is what I'm suggesting you. The merging you are trying to get seems a little too complicated for .merge(), tried using conditions on columns instead.

permno = []
date = []
linkdt = []
linkenddt = []
prc = []
for i in range(0, df1.shape[0]):
    for j in range(0, df2.shape[0]):
        if (df1['PERMNO'][i]==df2['LPERMNO'][j]) & (df1['date'][i] > df2['LINKDT'][j]) & (df1['date'][i] < df2['LINKENDDT'][j]):
            permno.append(df1['PERMNO'][i])
            date.append(df1['date'][i])
            linkdt.append(df2['LINKDT'][j])
            linkenddt.append(df2['LINKENDDT'][j])
            prc.append(df2['PRC'][j])

merged = pd.DataFrame(list(zip(permno, linkdt, date, linkenddt, prc)), 
                       columns = ['PERMNO', 'LINKDT', 'LINKENDDT', 'PRC'])
      

Upvotes: 0

Related Questions