Danish
Danish

Reputation: 2871

replace a column of a df with the another column of an another df based on specific condition in pandas

I have two dataframe as shown below df1 and df2 as shown below.

df1:

Date                t_factor     category
2020-02-01             5         A   
2020-02-02             2         B       
2020-02-03             1         C       
2020-02-04             2         A
2020-02-05             3         B
2020-02-06             3         C 
2020-02-07             3         A    
2020-02-08             9         B     
2020-02-09             1         C
2020-02-10             8         A
2020-02-11             3         B         
2020-02-12             3         C               

df2:

Date                  beta     
2020-02-01             100             
2020-02-02             230              
2020-02-03             150           
2020-02-04             100
2020-02-05             200  
2020-02-06             180          
2020-02-07             190            
2020-02-08             290 

from the above I would like to replace t_factor column of df1 with beta column of df2 based on the input date range.

The function could be like this.

def replace_column(df1, df2, start_date = `2020-02-03`, end_date = `2020-02-06`):
     df1 = df1.copy()
     df2 = df2.copy()
     df1 = df1.sort_values(['Date'], ascending=True)
     df2 = df2.sort_values(['Date'], ascending=True)
     df1['t_factor'] = df1['beta']  # for that date range
     return df1

Expected output: for start_date = 2020-02-03 and end_date = 2020-02-06

df1:

 Date                t_factor   category
2020-02-01             5         A   
2020-02-02             2         B       
2020-02-03             150       C       
2020-02-04             100       A
2020-02-05             200       B
2020-02-06             180       C 
2020-02-07             3         A    
2020-02-08             9         B     
2020-02-09             1         C
2020-02-10             8         A
2020-02-11             3         B         
2020-02-12             3         C               





   

Note: df2 has less data, final date of df2 is 2020-02-08.

if start_date = `2020-02-07`  and end_date = `2020-02-11`.

Then Expected output:

Date                t_factor     category
2020-02-01             5         A   
2020-02-02             2         B       
2020-02-03             1         C       
2020-02-04             2         A
2020-02-05             3         B
2020-02-06             3         C 
2020-02-07             190       A    
2020-02-08             290       B     
2020-02-09             1         C
2020-02-10             8         A
2020-02-11             3         B         
2020-02-12             3         C   

print ('df2 dont have data after 2020-02-08')

Upvotes: 1

Views: 72

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Use pd.to_datetime to convert the Date like columns to pandas datetime series.

df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

Then use Series.between and specify the start date(left) and end date(right) to create a boolean mask m, then use boolean indexing with this mask and use Series.map to map the beta values from df2 to t_function values in df1.

m = df1['Date'].between('2020-02-03', '2020-02-06', inclusive=True)
df1.loc[m, 't_factor'] = df1['Date'].map(df2.set_index('Date')['beta']).fillna(df1['t_factor'])

Another idea using DataFrame.merge:

df1 = df1.merge(df2, on='Date', how='left')
m = df1['Date'].between('2020-02-03', '2020-02-06', inclusive=True)
df1.loc[m, 't_factor'] = df1.pop('beta').fillna(df1['t_factor'])

Result:

# start=2020-02-03, end=2020-02-06
         Date  t_factor category
0  2020-02-01       5.0        A
1  2020-02-02       2.0        B
2  2020-02-03     150.0        C
3  2020-02-04     100.0        A
4  2020-02-05     200.0        B
5  2020-02-06     180.0        C
6  2020-02-07       3.0        A
7  2020-02-08       9.0        B
8  2020-02-09       1.0        C
9  2020-02-10       8.0        A
10 2020-02-11       3.0        B
11 2020-02-12       3.0        C

# start=2020-02-07, end=2020-02-11.
         Date  t_factor category
0  2020-02-01       5.0        A
1  2020-02-02       2.0        B
2  2020-02-03       1.0        C
3  2020-02-04       2.0        A
4  2020-02-05       3.0        B
5  2020-02-06       3.0        C
6  2020-02-07     190.0        A
7  2020-02-08     290.0        B
8  2020-02-09       1.0        C
9  2020-02-10       8.0        A
10 2020-02-11       3.0        B
11 2020-02-12       3.0        C

Function that wraps the merging method (Method 2):

def fx(df1, df2, start, end):
    if df2['Date'].max() < pd.Timestamp(end):
        print(f"we dont have data beyound {df2['Date'].max()}")

    df1 =  df1.merge(df2, on='Date', how='left')
    m = df1['Date'].between(start, end, inclusive=True)
    df1.loc[m, 't_factor'] = df1.pop('beta').fillna(df1['t_factor'])
    return df1

Upvotes: 1

Luke
Luke

Reputation: 56

My solution uses df.join and df.loc methods.

First initialize the data.

df1 = pd.DataFrame({'Date' : ['2020-02-01', '2020-02-05', '2020-02-06', '2020-02-12'],'t_factor' : [5, 3, 3, 3]})
df2 = pd.DataFrame({'Date' : ['2020-02-05', '2020-02-06'],'beta' : [200, 180]})

Then set Date as index.

df1d = df1.set_index('Date')
df2d = df2.set_index('Date')

Now the key steps.

dfres=df1d.join(df2d)
dfres.loc[dfres['beta'].notnull(), 't_factor'] = dfres.loc[dfres['beta'].notnull()].beta

One more step to match the expected output.

output=dfres.drop(columns='beta')

Upvotes: 1

Related Questions