Reputation: 523
I've data from two different weather stations for a location. One station was installed during the 80s and the other station installed during the mid 90s. Due to instrumental error the readings got unreliable for the old station. And there are several instances of missing records for the new station too.
I plan on creating a complete weather profile by using the data from old station before the new station was installed and append data from new station. Also I want to use data from old station when there is instrumental failure in new station (NaN).
df_new =pd.DataFrame(
{'Date': {0: '01/01/1994', 1: '01/02/1994', 2: '01/03/1994', 3: '01/04/1994'},
'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
'TMAX': {0: -5.5, 1: np.nan, 2: -1.5, 3: np.nan},
'TMIN': {0: -11.64, 1: -10.55, 2: np.nan, 3: -11.41},
'WIND': {0: 4.1, 1: 6.8, 2: 5.4, 3: 9.6}})
df_old = pd.DataFrame(
{'Date': {0: '01/01/1980', 1: '01/02/1980', 2: '01/03/1980', 3: '01/04/1980'},
'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
'TMAX': {0: -5.5, 1: -3.5, 2: -1.5, 3: -2.8},
'TMIN': {0: -11.64, 1: -10.55, 2: -14.33, 3: -11.41},
'WIND': {0: 4.1, 1: 6.8, 2: 5.4, 3: 9.6}})
How can I combine the two df and create a new df (df_complete) like this?
df_complete
Date Rain TMAX TMIN WIND
01/01/1980 0 -5.5 -11.64 4.1
01/02/1980 0 -3.5 -10.55 6.8
01/03/1980 0 -1.5 -14.33 5.4
.
.
01/01/1994 0 -5.5 -11.64 4.1
01/02/1994 0 5.7 -10.55 6.8
01/03/1994 0 -1.5 -10.58 5.4
.
.
12/31/2014 0 -4.9 -10.21 3.5
Just to make myself more clear, the NaN values in df_new were replaced using the values from df_old
Upvotes: 0
Views: 89
Reputation: 10349
First of all, mark your data so you know the source once merged:
df_old['source'] = 'old'
df_new['source'] = 'new'
Using a helper function you can achieve the result with just two statements:
df_combined = combine(df_new, df_old,
cols=['TMAX', 'TMIN', 'WIND'],
on='Date')
pd.concat([df_old, df_combined])
=>
Date Rain TMAX TMIN WIND source updated
0 2015-01-02 0 0 6 0 old NaN
1 2015-01-03 1 8 -4 9 old NaN
2 2015-01-04 1 -3 9 12 old NaN
3 2015-01-05 0 -4 5 16 old NaN
4 2015-01-06 0 -2 9 5 old NaN
5 2015-01-07 1 3 8 12 old NaN
6 2015-01-08 0 9 -2 6 old NaN
7 2015-01-09 1 7 -3 11 old NaN
...
19 2015-01-21 0 -5 -1 8 new
20 2015-01-22 0 2 -5 1 new
21 2015-01-23 1 7 8 17 new
22 2015-01-24 0 6 5 8 new TMAX
23 2015-01-25 1 -3 0 13 new TMAX
24 2015-01-26 1 4 -3 8 new
25 2015-01-27 0 1 -2 7 new TMIN
26 2015-01-28 0 -4 5 0 new
27 2015-01-29 1 9 -3 3 new TMIN
Note that combine
also tells us which values were updated.
The combine
function is as follows. It generalizes my original answer into a selective version of the pandas.combine_first
functionality. Selective meaning you can specify the merge keys and columns to combine without fiddling with indexes:
def combine(df_left, df_right, cols=None, on=None):
def check(r):
updated = []
for c in cols:
xc = '%s_x' % c
yc = '%s_y' % c
if math.isnan(r[xc]):
r[xc] = r[yc]
updated.append(c)
r['updated'] = ','.join(updated)
return r
left_expanded = df_left.merge(df_right, on=on, how='left')
left_expanded = left_expanded.apply(check, axis=1)
columns = [('%s' % c, c.replace('_x', ''))
for c in left_expanded.columns]
left_expanded.rename(columns=dict(columns), inplace=True)
return left_expanded[list(df_left.columns) + ['updated']]
My initial answer is still available in the history or with the full solution. Kudos to Dickster for bringing up combine_first
, which inspired me to generalize my original approach.
Upvotes: 1
Reputation: 3009
Try combine_first().
import numpy as np
import pandas as pd
from pandas.tseries.offsets import DateOffset
df_new =pd.DataFrame(
{'Date': {0: '01/01/1994', 1: '01/02/1994', 2: '01/03/1994', 3: '01/04/1994'},
'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
'TMAX': {0: -5.5, 1: np.nan, 2: -1.5, 3: np.nan},
'TMIN': {0: -11.64, 1: -10.55, 2: np.nan, 3: -11.41},
'WIND': {0: 4.1, 1: 6.8, 2: 5.4, 3: 9.6}})
df_old = pd.DataFrame(
{'Date': {0: '01/01/1980', 1: '01/02/1980', 2: '01/03/1980', 3: '01/04/1980'},
'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
'TMAX': {0: -5.5, 1: -3.5, 2: -1.5, 3: -2.8},
'TMIN': {0: -11.64, 1: -10.55, 2: -14.33, 3: -11.41},
'WIND': {0: 4.1, 1: 6.8, 2: 5.4, 3: 9.6}})
# Date Type Cast
df_old['Date'] = pd.to_datetime(df_old['Date'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
# I'm assuming a 14 year roll forward.
df_old['DateForward'] = df_old['Date'] + pd.DateOffset(years=14)
df_old.set_index('DateForward',inplace=True)
df_new.set_index('Date',inplace=True)
df_new.combine_first(df_old).reset_index(drop=True)
Date Rain TMAX TMIN WIND
0 1980-01-01 0 -5.5 -11.64 4.1
1 1980-01-02 0 -3.5 -10.55 6.8
2 1980-01-03 0 -1.5 -14.33 5.4
3 1980-01-04 0 -2.8 -11.41 9.6
Upvotes: 2
Reputation: 4090
try pandas concatentate:
together = pd.concat([df_new, df_old])
I'm not really sure what you're looking to do with the NaN's. Are you trying to fill the null observations with matching dates from each dataframe?
if they have the same number/order of observations you might be able to do something like:
df_new['TMAX'].fillna(df_old['TMAX'])
Upvotes: 0