RoshADM
RoshADM

Reputation: 87

Time Series: Fill NaNs from another dataframe

I am working with temperature data and I have created a file that has multi-year averages of few thousand cities and the format is as below(df1)

Date    City    PRCP    TMAX    TMIN    TAVG
01-Jan  Zurich  0.94    3.54    0.36    1.95
01-Feb  Zurich  4.12    9.14    3.04    6.09
01-Mar  Zurich  4.1     5.9     0.3     3.1
01-Apr  Zurich  0.32    13.78   4.22    9
01-May  Zurich  9.42    11.32   5.34    8.33
.
.
.....

I have the above data for all 365 days with no nulls. Notice that the date column only has day and month because year is irrelevant.

Based on the above data I am trying to clean yearly files, my second dataframe has data in the below format(df2)

ID      Date        City    PRCP    TAVG    TMAX    TMIN
abcd1   2020-01-01  Zurich  0       -1.9    -0.9    
abcd1   2020-01-02  Zurich  9.1             12.7    4.9
abcd1   2020-01-03  Zurich  0.8      8.55   13.2    3.9
abcd1   2020-01-04  Zurich  0        4.1    10.8    -2.6

.
.
.....

Each city has a unique ID. The date column has the format %y-%m-%d.

I am trying to replace the nulls in the second dataframe with the values in my first dataframe by matching day and month. This is what I tried

df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')   ##date format change##
df1["Date"] = df1['Date'].dt.strftime('%d-%m')
df2 = df2.drop(columns='ID')

df2 = df2.fillna(df1)         ##To replace nulls##

df1["Date"] = pd.to_datetime(df1["Date"], errors = 'coerce')
df1["Date"] = df1['Date'].dt.strftime('%Y-%m-%d')      ## Change data back to original format##

Even with this I end up with nulls in my yearly file i.e. df2{Note: df1 has no nulls}

Please suggest a better way to replace only nulls or any corrections to the code if necessary.

Upvotes: 1

Views: 236

Answers (1)

SeaBean
SeaBean

Reputation: 23227

We can approach by adding a column Date2 onto df2 with the same format as the Date column on df1. Then, while setting both dataframes with this date format and City as index, we perform an update on df2 using .update(), as follows:

df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b')          #  dd-MMM (e.g. 01-JAN)

df2a = df2.set_index(['Date2', 'City'])        # Create df2a from df2 with set index on Date2 and City

df2a.update(df1.set_index(['Date', 'City']), overwrite=False)   # update only NaN values of df2a by corresponding values of df1

df2 = df2a.reset_index(level=1).reset_index(drop=True)    # result put back to df2 throwing away the temp `Date2` row index

df2.insert(2, 'City', df2.pop('City'))    # relocate column City back to its original position

.update() is to modify in place using non-NA values from another DataFrame. The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated. Hence, we make both dataframe with same row index so that updates will be performed on corresponding columns with same column index/labels.

Note that we use the parameter overwrite=False in .update() to ensure we only update values that are NaN in the original DataFrame df2.

Demo

Data Setup:

Added data onto df1 to showcase replacing values of df2 from df1:

print(df1)

     Date    City  PRCP   TMAX  TMIN  TAVG
0  01-Jan  Zurich  0.94   3.54  0.36  1.95
1  02-Jan  Zurich  0.95   3.55  0.37  1.96       <=== Added this row
2  01-Feb  Zurich  4.12   9.14  3.04  6.09
3  01-Mar  Zurich  4.10   5.90  0.30  3.10
4  01-Apr  Zurich  0.32  13.78  4.22  9.00
5  01-May  Zurich  9.42  11.32  5.34  8.33

print(df2)       #  before processing

      ID        Date    City  PRCP  TAVG  TMAX  TMIN
0  abcd1  2020-01-01  Zurich   0.0 -1.90  -0.9   NaN         <=== with NaN value
1  abcd1  2020-01-02  Zurich   9.1   NaN  12.7   4.9         <=== with NaN value
2  abcd1  2020-01-03  Zurich   0.8  8.55  13.2   3.9
3  abcd1  2020-01-04  Zurich   0.0  4.10  10.8  -2.6

Run new codes:

df2["Date2"] = pd.to_datetime(df2["Date"], errors = 'coerce').dt.strftime('%d-%b')          #  dd-MMM (e.g. 01-JAN)

df2a = df2.set_index(['Date2', 'City'])        # Create df2a from df2 with set index on Date2 and City

df2a.update(df1.set_index(['Date', 'City']), overwrite=False)   # update only NaN values of df2a by corresponding values of df1

df2 = df2a.reset_index(level=1).reset_index(drop=True)    # result put back to df2 throwing away the temp `Date2` row index

df2.insert(2, 'City', df2.pop('City'))    # relocate column City back to its original position

Result:

print(df2)


      ID       Date    City  PRCP  TAVG  TMAX  TMIN
0  abcd1 2020-01-01  Zurich   0.0 -1.90  -0.9  0.36     <== TMIN updated with df1 value
1  abcd1 2020-01-02  Zurich   9.1  1.96  12.7  4.90     <== TAVG updated with df1 value
2  abcd1 2020-01-03  Zurich   0.8  8.55  13.2  3.90
3  abcd1 2020-01-04  Zurich   0.0  4.10  10.8 -2.60

Upvotes: 1

Related Questions