William
William

Reputation: 4034

How to merge or combine 1 pandas dataframe to another one based on multiple conditions

I have 2 dataframes:

df1 and df2 ,df1 is used to be a reference or lookups file for df2.It means we need use each row of df1 to match with each row of df2 and then merge df1 into df2 and then out put new df2.

df1:

    RB  BeginDate   EndDate    Valindex0
0   00  19000100    19811231    45
1   00  19820100    19841299    47
2   00  19850100    20010699    50
3   00  20010700    99999999    39

df2:

    RB  IssueDate   gs
0   L3  19990201    8
1   00  19820101    G
2   48  19820101    G
3   50  19820101    G
4   50  19820101    G
5   00  19860101    G
6   52  19820101    G
7   53  19820101    G
8   00  19500201    G

how to merge this 2 dataframes base on condition:

if df1['BeginDate'] <= df2['IssueDate'] <= df1['EndDate'] and df1['RB']==df2['RB']:
    merge the value of df1['Valindex0'] to df2

Notice the final out put is merge df1 into df2,because df1 is just like a reference or lookup file for df2.It means we need use each row of df1 to match with each row of df2,then output new df2

The output should looks like:

df2:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   None
1   00  19820101    G   47    # df2['RB']==df1['RB'] and df2['IssueDate'] between df1['BeginDate'] and df1['EndDate'] of this row
2   48  19820101    G   None
3   50  19820101    G   None
4   50  19820101    G   None
5   00  19860101    G   50
6   52  19820101    G   None
7   53  19820101    G   None
8   00  19500201    G   45

I know one method to do this,but it is extremally slow, especially when the length of d1 is big:

conditions = []

for index, row in df1.iterrows():
    conditions.append((df2['IssueDate']>= df1['BeginDate']) &
                      (df2['IssueDate']<= df1['BeginDate'])&
                      (df2['RB']==df1['RB']))

df2['Valindex0'] = np.select(conditions, df1['Valindex0'], default=None)

Any faster solution?

Upvotes: 1

Views: 97

Answers (2)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Use IntervalIndex -

idx = pd.IntervalIndex.from_arrays(df1['BeginDate'],df1['EndDate'],closed='both')
for x in df1['RB'].unique():
    mask = df2['RB']==x
    df2.loc[mask, 'Valindex0'] = df1.loc[idx.get_indexer(df2.loc[mask, 'IssueDate']), 'Valindex0'].values

Output

   RB  IssueDate gs  Valindex0
0  L3   19990201  8        NaN
1  00   19820101  G       47.0
2  48   19820101  G        NaN
3  50   19820101  G        NaN
4  50   19820101  G        NaN
5  00   19860101  G       50.0
6  52   19820101  G        NaN
7  53   19820101  G        NaN
8  00   19500201  G       45.0

Upvotes: 2

fthomson
fthomson

Reputation: 789

If you dont care about the ordering of the records or index you could do this

df1 = pd.read_clipboard()
df2 = pd.read_clipboard()

# pandas wants to cast this column as int
df1['RB'] = '00'

new_df = df2.merge(df1, how='outer', on='RB')

mask = ((new_df['BeginDate'] <= new_df['IssueDate']) & (new_df['IssueDate'] <= new_df['EndDate'])
       )| new_df['Valindex0'].isnull()

new_df[['RB','IssueDate', 'gs', 'Valindex0']][mask]

The idea is to do a full outer join first and then filter the dataset after.

Results:

    RB  IssueDate   gs  Valindex0
0   L3  19990201    8   NaN
2   00  19820101    G   47.0
7   00  19860101    G   50.0
9   00  19500201    G   45.0
13  48  19820101    G   NaN
14  50  19820101    G   NaN
15  50  19820101    G   NaN
16  52  19820101    G   NaN
17  53  19820101    G   NaN

Upvotes: 1

Related Questions