Reputation: 87
I am working with temperature data and I have created a file that has multi-year averages of few thousand cities and the format is as below(df1
)
Date City PRCP TMAX TMIN TAVG
01-Jan Zurich 0.94 3.54 0.36 1.95
01-Feb Zurich 4.12 9.14 3.04 6.09
01-Mar Zurich 4.1 5.9 0.3 3.1
01-Apr Zurich 0.32 13.78 4.22 9
01-May Zurich 9.42 11.32 5.34 8.33
.
.
.....
I have the above data for all 365 days with no nulls. Notice that the date
column only has day
and month
because year is irrelevant.
Based on the above data I am trying to clean yearly files, my second dataframe has data in the below format(df2
)
ID Date City PRCP TAVG TMAX TMIN
abcd1 2020-01-01 Zurich 0 -1.9 -0.9
abcd1 2020-01-02 Zurich 9.1 12.7 4.9
abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
abcd1 2020-01-04 Zurich 0 4.1 10.8 -2.6
.
.
.....
Each city has a unique ID
. The date column has the format %y-%m-%d
.
I am trying to replace the nulls in the second dataframe with the values in my first dataframe by matching day
and month
. This is what I tried
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce') ##date format change##
df1["Date"] = df1['Date'].dt.strftime('%d-%m')
df2 = df2.drop(columns='ID')
df2 = df2.fillna(df1) ##To replace nulls##
df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')
df1["Date"] = df1['Date'].dt.strftime('%Y-%m-%d') ## Change data back to original format##
Even with this I end up with nulls in my yearly file i.e. df2
{Note: df1 has no nulls}
Please suggest a better way to replace only nulls or any corrections to the code if necessary.
Upvotes: 1
Views: 236
Reputation: 23227
We can approach by adding a column Date2
onto df2
with the same format as the Date
column on df1
. Then, while setting both dataframes with this date format and City as index, we perform an update on df2 using .update()
, as follows:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
.update()
is to modify in place using non-NA values from another DataFrame. The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated. Hence, we make both dataframe with same row index so that updates will be performed on corresponding columns with same column index/labels.
Note that we use the parameter overwrite=False
in .update()
to ensure we only update values that are NaN in the original DataFrame df2
.
Data Setup:
Added data onto df1
to showcase replacing values of df2
from df1
:
print(df1)
Date City PRCP TMAX TMIN TAVG
0 01-Jan Zurich 0.94 3.54 0.36 1.95
1 02-Jan Zurich 0.95 3.55 0.37 1.96 <=== Added this row
2 01-Feb Zurich 4.12 9.14 3.04 6.09
3 01-Mar Zurich 4.10 5.90 0.30 3.10
4 01-Apr Zurich 0.32 13.78 4.22 9.00
5 01-May Zurich 9.42 11.32 5.34 8.33
print(df2) # before processing
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 NaN <=== with NaN value
1 abcd1 2020-01-02 Zurich 9.1 NaN 12.7 4.9 <=== with NaN value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.9
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.6
Run new codes:
df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b') # dd-MMM (e.g. 01-JAN)
df2a = df2.set_index(['Date2', 'City']) # Create df2a from df2 with set index on Date2 and City
df2a.update(df1.set_index(['Date', 'City']), overwrite=False) # update only NaN values of df2a by corresponding values of df1
df2 = df2a.reset_index(level=1).reset_index(drop=True) # result put back to df2 throwing away the temp `Date2` row index
df2.insert(2, 'City', df2.pop('City')) # relocate column City back to its original position
Result:
print(df2)
ID Date City PRCP TAVG TMAX TMIN
0 abcd1 2020-01-01 Zurich 0.0 -1.90 -0.9 0.36 <== TMIN updated with df1 value
1 abcd1 2020-01-02 Zurich 9.1 1.96 12.7 4.90 <== TAVG updated with df1 value
2 abcd1 2020-01-03 Zurich 0.8 8.55 13.2 3.90
3 abcd1 2020-01-04 Zurich 0.0 4.10 10.8 -2.60
Upvotes: 1