pizza lover
pizza lover

Reputation: 523

Merging dataframes based on Time

I've data from two different weather stations for a location. One station was installed during the 80s and the other station installed during the mid 90s. Due to instrumental error the readings got unreliable for the old station. And there are several instances of missing records for the new station too.

I plan on creating a complete weather profile by using the data from old station before the new station was installed and append data from new station. Also I want to use data from old station when there is instrumental failure in new station (NaN).

 df_new =pd.DataFrame(
{'Date': {0: '01/01/1994', 1: '01/02/1994', 2: '01/03/1994', 3: '01/04/1994'},
 'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
 'TMAX': {0: -5.5, 1: np.nan, 2: -1.5, 3: np.nan},
 'TMIN': {0: -11.64, 1: -10.55, 2: np.nan, 3: -11.41},
 'WIND': {0: 4.1,  1: 6.8,  2: 5.4,  3: 9.6}})

df_old = pd.DataFrame(    
     {'Date': {0: '01/01/1980', 1: '01/02/1980', 2: '01/03/1980', 3: '01/04/1980'},
     'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
     'TMAX': {0: -5.5, 1: -3.5, 2: -1.5, 3: -2.8},
     'TMIN': {0: -11.64,  1: -10.55,  2: -14.33, 3: -11.41}, 
      'WIND': {0: 4.1,  1: 6.8,  2: 5.4, 3: 9.6}})

How can I combine the two df and create a new df (df_complete) like this?

df_complete
Date       Rain TMAX  TMIN  WIND
01/01/1980  0   -5.5 -11.64 4.1
01/02/1980  0   -3.5 -10.55 6.8
01/03/1980  0   -1.5 -14.33 5.4
.
.
01/01/1994  0   -5.5 -11.64 4.1
01/02/1994  0    5.7 -10.55 6.8
01/03/1994  0   -1.5 -10.58 5.4
.
.
12/31/2014  0   -4.9 -10.21 3.5

Just to make myself more clear, the NaN values in df_new were replaced using the values from df_old

Upvotes: 0

Views: 89

Answers (3)

miraculixx
miraculixx

Reputation: 10349

First of all, mark your data so you know the source once merged:

df_old['source'] = 'old' 
df_new['source'] = 'new'

Using a helper function you can achieve the result with just two statements:

df_combined = combine(df_new, df_old, 
                      cols=['TMAX', 'TMIN', 'WIND'], 
                      on='Date')
pd.concat([df_old, df_combined])
=>
   Date        Rain  TMAX  TMIN  WIND source updated
0  2015-01-02     0     0     6     0    old     NaN
1  2015-01-03     1     8    -4     9    old     NaN
2  2015-01-04     1    -3     9    12    old     NaN
3  2015-01-05     0    -4     5    16    old     NaN
4  2015-01-06     0    -2     9     5    old     NaN
5  2015-01-07     1     3     8    12    old     NaN
6  2015-01-08     0     9    -2     6    old     NaN
7  2015-01-09     1     7    -3    11    old     NaN
...
19 2015-01-21     0    -5    -1     8    new        
20 2015-01-22     0     2    -5     1    new        
21 2015-01-23     1     7     8    17    new        
22 2015-01-24     0     6     5     8    new    TMAX
23 2015-01-25     1    -3     0    13    new    TMAX
24 2015-01-26     1     4    -3     8    new        
25 2015-01-27     0     1    -2     7    new    TMIN
26 2015-01-28     0    -4     5     0    new        
27 2015-01-29     1     9    -3     3    new    TMIN

Note that combine also tells us which values were updated.

The combine function is as follows. It generalizes my original answer into a selective version of the pandas.combine_first functionality. Selective meaning you can specify the merge keys and columns to combine without fiddling with indexes:

def combine(df_left, df_right, cols=None, on=None):
    def check(r):
        updated = []
        for c in cols:
            xc = '%s_x' % c
            yc = '%s_y' % c
            if math.isnan(r[xc]):
                r[xc] = r[yc]
                updated.append(c)
        r['updated'] = ','.join(updated)
        return r
    left_expanded = df_left.merge(df_right, on=on, how='left')
    left_expanded = left_expanded.apply(check, axis=1)
    columns = [('%s' % c, c.replace('_x', '')) 
                for c in left_expanded.columns] 
    left_expanded.rename(columns=dict(columns), inplace=True)
    return left_expanded[list(df_left.columns) + ['updated']]

My initial answer is still available in the history or with the full solution. Kudos to Dickster for bringing up combine_first, which inspired me to generalize my original approach.

Upvotes: 1

Dickster
Dickster

Reputation: 3009

Try combine_first().

import numpy as np
import pandas as pd
from pandas.tseries.offsets import DateOffset

df_new =pd.DataFrame(
{'Date': {0: '01/01/1994', 1: '01/02/1994', 2: '01/03/1994', 3: '01/04/1994'},
 'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
 'TMAX': {0: -5.5, 1: np.nan, 2: -1.5, 3: np.nan},
 'TMIN': {0: -11.64, 1: -10.55, 2: np.nan, 3: -11.41},
 'WIND': {0: 4.1,  1: 6.8,  2: 5.4,  3: 9.6}})

df_old = pd.DataFrame(    
     {'Date': {0: '01/01/1980', 1: '01/02/1980', 2: '01/03/1980', 3: '01/04/1980'},
     'Rain': {0: 0, 1: 0, 2: 0, 3: 0},
     'TMAX': {0: -5.5, 1: -3.5, 2: -1.5, 3: -2.8},
     'TMIN': {0: -11.64,  1: -10.55,  2: -14.33, 3: -11.41}, 
      'WIND': {0: 4.1,  1: 6.8,  2: 5.4, 3: 9.6}})

# Date Type Cast 
df_old['Date'] = pd.to_datetime(df_old['Date'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
# I'm assuming a 14 year roll forward.
df_old['DateForward'] = df_old['Date'] + pd.DateOffset(years=14)
df_old.set_index('DateForward',inplace=True)
df_new.set_index('Date',inplace=True)
df_new.combine_first(df_old).reset_index(drop=True)

        Date  Rain  TMAX   TMIN  WIND
0 1980-01-01     0  -5.5 -11.64   4.1
1 1980-01-02     0  -3.5 -10.55   6.8
2 1980-01-03     0  -1.5 -14.33   5.4
3 1980-01-04     0  -2.8 -11.41   9.6

Upvotes: 2

Sam
Sam

Reputation: 4090

try pandas concatentate:

together = pd.concat([df_new, df_old])

I'm not really sure what you're looking to do with the NaN's. Are you trying to fill the null observations with matching dates from each dataframe?

if they have the same number/order of observations you might be able to do something like:

df_new['TMAX'].fillna(df_old['TMAX'])

Upvotes: 0

Related Questions