user8955502
user8955502

Reputation: 13

pandas dataframe update based on date values in 2 dataframes

I have two dataframes, a snippet looks like this:

year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

The dataframes are NOT indexed on date (index is some other column). I want to merge the dataframes with rows where Date Values match in these two dataframes and add a new column based on date match:

df_FINAL['AVG_TEMP'] = (df1['TEMP'] + df2['TEMP']) / 2

So the resultant DataFrame should be like:

   DAY  TEMP      DATE    AVG_TEMP
0  MON    15  01/01/20     13.5
1  MON    15  02/01/20     14.0
2  MON    15  03/01/20     14.5
3  TUE    15  06/01/20     15.0
4  TUE    14  07/01/20     14.5

How to achieve this?

Upvotes: 1

Views: 647

Answers (4)

Derek O
Derek O

Reputation: 19545

You can use pd.merge on the DATE and DAY columns since the same dates will have the same day. Take the average of the TEMP_x and TEMP_y columns created from the merge and call it AVG_TEMP, then drop the TEMP_x and TEMP_y columns.

import pandas as pd

year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

df_result = df1.merge(df2, on=["DATE","DAY"])
df_result['AVG_TEMP'] = (df_result['TEMP_x'] + df_result['TEMP_y']) / 2
df_result = df_result.drop(columns=['TEMP_x','TEMP_y'])

Output:

>>> df_result
   DAY      DATE  AVG_TEMP
0  MON  01/01/20      13.5
1  MON  02/01/20      14.0
2  MON  03/01/20      14.5
3  TUE  06/01/20      15.0
4  TUE  07/01/20      14.5

Upvotes: 2

deadshot
deadshot

Reputation: 9051

Using pd.concat() and df.groupby

df3 = pd.concat([df2, df1])
df3['AVG_TEMP'] = df3.groupby('DATE', as_index=False)['TEMP'].apply(lambda x: x.mean() if len(x) > 1 else None)
df3 = df3.groupby('DATE', as_index=False).first().dropna()

print(df3)

Output:

       DATE  DAY  TEMP  AVG_TEMP
0  01/01/20  MON    15      13.5
1  02/01/20  MON    15      14.0
2  03/01/20  MON    15      14.5
3  06/01/20  TUE    15      15.0
4  07/01/20  TUE    14      14.5

Upvotes: 0

Joe Ferndz
Joe Ferndz

Reputation: 8508

You can use merge command and accomplish all this using lambda function. I have also provided you a few alternate options so you know they are available for you.

import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[12, 13, 14, 15, 15, 18],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)

year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
    'TEMP':[15, 15, 15, 15, 14, 14],
    'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)

#merge on inner join based on your example
#you can either use rename or suffix. here i am using suffix
#first suffix is stripped, second is _y which will be later dropped
#kept .rename command in case you want to try that option

answer for your problem starts here:

df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))        
        #.rename(columns={'DAY_x':'DAY','TEMP_x':'TEMP'})
        .assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2))

#drop the _y columns as you don't need them
df_FINAL.drop(list(df_FINAL.filter(regex='_y$')), axis=1, inplace=True)

print(df_FINAL)

An alternate way to do this is to combine all these into a single command as follows:

#merge on inner join based on your example
#first suffix is stripped, second is _y which will be later dropped
#after the processing, filter out the column with _y

df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))        
        .assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2)
        .filter(regex='^(?!.*_y)'))

The final result looks like this:

   DAY  TEMP      DATE  AVG_TEMP
0  MON    15  01/01/20      13.5
1  MON    15  02/01/20      14.0
2  MON    15  03/01/20      14.5
3  TUE    15  06/01/20      15.0
4  TUE    14  07/01/20      14.5

Upvotes: 0

Jeff
Jeff

Reputation: 634

Call pd.merge() on your two columns using inner join (value must be present in both df to be present in result) to create an intermediate df. Then create a new column that calculates the average

df3 = df1.merge(df2,on=['DATE','DAY'],how='inner')
df3['AVG_TEMP'] = (df3.TEMP_x + df3.TEMP_y)/2
df3.drop(['TEMP_x','TEMP_y'],inplace=True,axis=1)

Upvotes: 0

Related Questions