Reputation: 4034
I have 2 dataframes:
df1 and df2 ,df1 is used to be a reference or lookups file for df2.It means we need use each row of df1 to match with each row of df2 and then merge df1 into df2 and then out put new df2.
df1:
RB BeginDate EndDate Valindex0
0 00 19000100 19811231 45
1 00 19820100 19841299 47
2 00 19850100 20010699 50
3 00 20010700 99999999 39
df2:
RB IssueDate gs
0 L3 19990201 8
1 00 19820101 G
2 48 19820101 G
3 50 19820101 G
4 50 19820101 G
5 00 19860101 G
6 52 19820101 G
7 53 19820101 G
8 00 19500201 G
how to merge this 2 dataframes base on condition:
if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
merge the value of df1['Valindex0'] to df2
Notice the final out put is merge df1 into df2,because df1 is just like a reference or lookup file for df2.It means we need use each row of df1 to match with each row of df2,then output new df2
The output should looks like:
df2:
RB IssueDate gs Valindex0
0 L3 19990201 8 None
1 00 19820101 G 47 # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2 48 19820101 G None
3 50 19820101 G None
4 50 19820101 G None
5 00 19860101 G 50
6 52 19820101 G None
7 53 19820101 G None
8 00 19500201 G 45
I know one method to do this,but it is extremally slow, especially when the length of d1 is big:
conditions = []
for index, row in df1.iterrows():
conditions.append((df2['IssueDate']>= df1['BeginDate']) &
(df2['IssueDate']<= df1['BeginDate'])&
(df2['RB']==df1['RB']))
df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)
Any faster solution?
Upvotes: 1
Views: 97
Reputation: 9081
Use IntervalIndex -
idx = pd.IntervalIndex.from_arrays(df1['BeginDate'],df1['EndDate'],closed='both')
for x in df1['RB'].unique():
mask = df2['RB']==x
df2.loc[mask, 'Valindex0'] = df1.loc[idx.get_indexer(df2.loc[mask, 'IssueDate']), 'Valindex0'].values
Output
RB IssueDate gs Valindex0
0 L3 19990201 8 NaN
1 00 19820101 G 47.0
2 48 19820101 G NaN
3 50 19820101 G NaN
4 50 19820101 G NaN
5 00 19860101 G 50.0
6 52 19820101 G NaN
7 53 19820101 G NaN
8 00 19500201 G 45.0
Upvotes: 2
Reputation: 789
If you dont care about the ordering of the records or index you could do this
df1 = pd.read_clipboard()
df2 = pd.read_clipboard()
# pandas wants to cast this column as int
df1['RB'] = '00'
new_df = df2.merge(df1, how='outer', on='RB')
mask = ((new_df['BeginDate'] <= new_df['IssueDate']) & (new_df['IssueDate'] <= new_df['EndDate'])
)| new_df['Valindex0'].isnull()
new_df[['RB','IssueDate', 'gs', 'Valindex0']][mask]
The idea is to do a full outer join first and then filter the dataset after.
Results:
RB IssueDate gs Valindex0
0 L3 19990201 8 NaN
2 00 19820101 G 47.0
7 00 19860101 G 50.0
9 00 19500201 G 45.0
13 48 19820101 G NaN
14 50 19820101 G NaN
15 50 19820101 G NaN
16 52 19820101 G NaN
17 53 19820101 G NaN
Upvotes: 1