Reputation: 67
I want to merge two dataframes (df1
& df2
) based on the column PERMNO
resp. LPERMNO
:
df1.merge(df2, left_on='PERMNO', right_on='LPERMNO')
However, when merging I also need to consider that the column date
of df1
lies in between the range of the two columns LINKDT
& LINKENDDT
of df2
, since df2
holds different information for each row (see column PRC
of df2
).
For example, index [0,1]
of df1
should merge with index 0
of df2
and index [2,3]
of df1
should merge with index 2
of df2
, since the value of date
is within the two columns LINKDT
& LINKENDDT
.
Unfortunately, I am bit lost here, so any help is appreciated!
df1
PERMNO date
0 66325 2006-03-30
1 66325 2006-06-30
2 66325 2015-09-30
3 66325 2015-12-30
df2:
LPERMNO LINKDT LINKENDDT PRC
0 66325 1992-07-01 2014-04-30 10
1 66325 1983-09-23 1992-06-30 12
2 66325 2014-05-01 2019-12-31 8.5
UPDATE:
expected output
Output dataframe has shape 2'000'000x180.
PERMNO date LPERMNO LINKDT LINKENDDT PRC
0 66325 2006-03-30 66325 1992-07-01 2014-04-30 10
1 66325 2006-06-30 66325 1992-07-01 2014-04-30 10
2 66325 2015-09-30 66325 2014-05-01 2019-12-31 8.5
3 66325 2015-12-30 66325 2014-05-01 2019-12-31 8.5
Upvotes: 0
Views: 496
Reputation: 322
Try:
df_merged = df1.merge(df2, left_on=['PERMNO'], right_on=['LPERMNO'])
# if ['date', 'LINKDT', 'LINKENDDT'] already are in datetime format, ignore this three lines
df_merged['date'] = pd.to_datetime(df_merged['date'])
df_merged['LINKDT'] = pd.to_datetime(df_merged['LINKDT'])
df_merged['LINKENDDT'] = pd.to_datetime(df_merged['LINKENDDT'])
# Assuming LINKDT < LINKENDDT
df_merged = df_merged[(df_merged['date'] >= df_merged['LINKDT']) & (df_merged['date'] <= df_merged['LINKENDDT'])]
Output:
PERMNO date LPERMNO LINKDT LINKENDDT PRC
0 66325 2006-03-30 66325 1992-07-01 2014-04-30 10.0
3 66325 2006-06-30 66325 1992-07-01 2014-04-30 10.0
8 66325 2015-09-30 66325 2014-05-01 2019-12-31 8.5
11 66325 2015-12-30 66325 2014-05-01 2019-12-31 8.5
Upvotes: 2
Reputation: 81
I don't know if there's any simpler/faster/better way to do it, but this is what I'm suggesting you. The merging you are trying to get seems a little too complicated for .merge(), tried using conditions on columns instead.
permno = []
date = []
linkdt = []
linkenddt = []
prc = []
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if (df1['PERMNO'][i]==df2['LPERMNO'][j]) & (df1['date'][i] > df2['LINKDT'][j]) & (df1['date'][i] < df2['LINKENDDT'][j]):
permno.append(df1['PERMNO'][i])
date.append(df1['date'][i])
linkdt.append(df2['LINKDT'][j])
linkenddt.append(df2['LINKENDDT'][j])
prc.append(df2['PRC'][j])
merged = pd.DataFrame(list(zip(permno, linkdt, date, linkenddt, prc)),
columns = ['PERMNO', 'LINKDT', 'LINKENDDT', 'PRC'])
Upvotes: 0