ooikiam
ooikiam

Reputation: 103

Get the difference betwen two dataframe with different time series

I have 2 dataframes (df1 and df2) with following format. df1 is a simulation results. Hence, df1 is more densely populated timesteps wise (beginning of each monthly). df2 is actual observed data. Hence less available data (whenever is collected). Both df1 and df2 have different time series (timesteps) and are compiled for each location basis.

Sample data

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1:

    Date    Location    Sim
0   2018-02-01  1   3253
1   2018-03-01  1   3078
2   2018-04-01  1   3222
3   2018-05-01  1   3940
4   2018-06-01  1   3665
5   2018-07-01  1   3856
6   2018-02-01  2   3775
7   2018-03-01  2   3658
8   2018-04-01  2   3056
9   2018-05-01  2   3993
10  2018-06-01  2   3240
11  2018-07-01  2   3054
12  2018-02-01  3   3162
13  2018-03-01  3   3193
14  2018-04-01  3   3627
15  2018-05-01  3   3740
16  2018-06-01  3   3042
17  2018-07-01  3   3569

df2:

    Date    Location    Observed
0   2018-02-10  1   3668
1   2018-03-18  1   3102
2   2018-04-15  1   3128
3   2018-05-11  1   3485
4   2018-06-12  1   3926
5   2018-07-11  1   3344
6   2018-02-22  2   3134
7   2018-03-31  2   3258
8   2018-04-02  2   3833
9   2018-05-06  2   3883
10  2018-06-30  2   3122
11  2018-07-21  2   3417
12  2018-02-03  3   3551
13  2018-03-04  3   3971
14  2018-04-01  3   3294
15  2018-05-03  3   3207
16  2018-06-05  3   3803
17  2018-07-25  3   3250

enter image description here

I am looking for end results as picture/plot above. For each 'Location', resample the dates in 'Sim' data to daily freq and then interpolate or extrapolate (if necessary) linearly. Calculate the Delta (Delta=Observed - Sim) only on dates when 'Observed' data is available. Again for each 'Location' a plot similar to one attached above.

My thinking is to use df.groupby method to group each 'Location', series.resample to daily for Sim column in df1. Interpolate linearly df1 daily freq. Calculate the Delta on dates of Observed. And then plot them up.

Upvotes: 4

Views: 699

Answers (2)

godot
godot

Reputation: 1570

For the 1st part of your problem, you could concatenate your 2 dataframes, then interpolate and then filter the result according to the 1st time series.

df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})

df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index('Date')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')

Then, groupby, fill missing values & interpolate:

df1_daily = df1.groupby('Location').resample('D').mean()
df1_daily['Location'] = df1_daily.Location.fillna(method='pad')
df1_daily['Sim'] = df1_daily.Sim.interpolate(method='linear')

Prepare merge &... merge:

df2_grouped = df2.set_index(['Location',df2.index])
merge = df1_daily.merge(right=df2_grouped, left_index=True, right_index=True, how='left')#.sort_index()

Finally:

merge['Delta'] = merge.Observed - merge.Sim
merge[['Observed', 'Sim', 'Delta']].groupby('Location').plot.line(marker='o', ms=2)

enter image description here enter image description here enter image description here

Upvotes: 1

min2bro
min2bro

Reputation: 4638

I would suggest to construct a single dataframe using Series and then interpolate it

Observed= {0: 3668, 1: 3102, 2: 3128, 3: 3485, 4: 3926, 5: 3344, 6: 3134, 7: 3258, 8: 3833, 9: 3883, 10: 3122, 11: 3417, 12: 3551, 13: 3971, 14: 3294, 15: 3207, 16: 3803, 17: 3250}

y1 = pd.Series(Observed, index=Observed)

df = pd.DataFrame({'Date': {0: '2018-02-01', 1: '2018-03-01', 2: '2018-04-01', 3: '2018-05-01', 4: '2018-06-01', 5: '2018-07-01', 6: '2018-02-01', 7: '2018-03-01', 8: '2018-04-01', 9: '2018-05-01', 10: '2018-06-01', 11: '2018-07-01', 12: '2018-02-01', 13: '2018-03-01', 14: '2018-04-01', 15: '2018-05-01', 16: '2018-06-01', 17: '2018-07-01'}, 'Location': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3}, 
                   'Sim': {0: 3253, 1: 3078, 2: 3222, 3: 3940, 4: 3665, 5: 3856, 6: 3775, 7: 3658, 8: 3056, 9: 3993, 10: 3240, 11: 3054, 12: 3162, 13: 3193, 14: 3627, 15: 3740, 16: 3042, 17: 3569},
                   'Observed':Observed})


df.interpolate('index').reindex(Observed)

Upvotes: 1

Related Questions