Reputation: 13
I have two dataframes, a snippet looks like this:
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
The dataframes are NOT indexed on date (index is some other column). I want to merge the dataframes with rows where Date Values match in these two dataframes and add a new column based on date match:
df_FINAL['AVG_TEMP'] = (df1['TEMP'] + df2['TEMP']) / 2
So the resultant DataFrame should be like:
DAY TEMP DATE AVG_TEMP
0 MON 15 01/01/20 13.5
1 MON 15 02/01/20 14.0
2 MON 15 03/01/20 14.5
3 TUE 15 06/01/20 15.0
4 TUE 14 07/01/20 14.5
How to achieve this?
Upvotes: 1
Views: 647
Reputation: 19545
You can use pd.merge
on the DATE
and DAY
columns since the same dates will have the same day. Take the average of the TEMP_x
and TEMP_y
columns created from the merge and call it AVG_TEMP
, then drop the TEMP_x
and TEMP_y
columns.
import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
df_result = df1.merge(df2, on=["DATE","DAY"])
df_result['AVG_TEMP'] = (df_result['TEMP_x'] + df_result['TEMP_y']) / 2
df_result = df_result.drop(columns=['TEMP_x','TEMP_y'])
Output:
>>> df_result
DAY DATE AVG_TEMP
0 MON 01/01/20 13.5
1 MON 02/01/20 14.0
2 MON 03/01/20 14.5
3 TUE 06/01/20 15.0
4 TUE 07/01/20 14.5
Upvotes: 2
Reputation: 9051
Using pd.concat()
and df.groupby
df3 = pd.concat([df2, df1])
df3['AVG_TEMP'] = df3.groupby('DATE', as_index=False)['TEMP'].apply(lambda x: x.mean() if len(x) > 1 else None)
df3 = df3.groupby('DATE', as_index=False).first().dropna()
print(df3)
Output:
DATE DAY TEMP AVG_TEMP
0 01/01/20 MON 15 13.5
1 02/01/20 MON 15 14.0
2 03/01/20 MON 15 14.5
3 06/01/20 TUE 15 15.0
4 07/01/20 TUE 14 14.5
Upvotes: 0
Reputation: 8508
You can use merge command and accomplish all this using lambda function. I have also provided you a few alternate options so you know they are available for you.
import pandas as pd
year1 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[12, 13, 14, 15, 15, 18],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '08/01/20']}
df1 = pd.DataFrame(year1)
year2 = {'DAY':['MON', 'MON', 'MON', 'TUE', 'TUE', 'TUE'],
'TEMP':[15, 15, 15, 15, 14, 14],
'DATE':['01/01/20', '02/01/20', '03/01/20', '06/01/20', '07/01/20', '10/01/20']}
df2 = pd.DataFrame(year2)
#merge on inner join based on your example
#you can either use rename or suffix. here i am using suffix
#first suffix is stripped, second is _y which will be later dropped
#kept .rename command in case you want to try that option
answer for your problem starts here:
df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))
#.rename(columns={'DAY_x':'DAY','TEMP_x':'TEMP'})
.assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2))
#drop the _y columns as you don't need them
df_FINAL.drop(list(df_FINAL.filter(regex='_y$')), axis=1, inplace=True)
print(df_FINAL)
An alternate way to do this is to combine all these into a single command as follows:
#merge on inner join based on your example
#first suffix is stripped, second is _y which will be later dropped
#after the processing, filter out the column with _y
df_FINAL = (pd.merge(df2, df1, on = "DATE",how='inner',suffixes=('', '_y'))
.assign(AVG_TEMP = lambda x: (x['TEMP'] + x['TEMP_y'])/2)
.filter(regex='^(?!.*_y)'))
The final result looks like this:
DAY TEMP DATE AVG_TEMP
0 MON 15 01/01/20 13.5
1 MON 15 02/01/20 14.0
2 MON 15 03/01/20 14.5
3 TUE 15 06/01/20 15.0
4 TUE 14 07/01/20 14.5
Upvotes: 0
Reputation: 634
Call pd.merge()
on your two columns using inner join (value must be present in both df
to be present in result) to create an intermediate df
. Then create a new column that calculates the average
df3 = df1.merge(df2,on=['DATE','DAY'],how='inner')
df3['AVG_TEMP'] = (df3.TEMP_x + df3.TEMP_y)/2
df3.drop(['TEMP_x','TEMP_y'],inplace=True,axis=1)
Upvotes: 0