Reputation: 103
I have 2 dataframes (df1 and df2) with following format. df1 is a simulation results. Hence, df1 is more densely populated timesteps wise (beginning of each monthly). df2 is actual observed data. Hence less available data (whenever is collected). Both df1 and df2 have different time series (timesteps) and are compiled for each location basis.
Sample data
df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})
df1:
Date Location Sim
0 2018-02-01 1 3253
1 2018-03-01 1 3078
2 2018-04-01 1 3222
3 2018-05-01 1 3940
4 2018-06-01 1 3665
5 2018-07-01 1 3856
6 2018-02-01 2 3775
7 2018-03-01 2 3658
8 2018-04-01 2 3056
9 2018-05-01 2 3993
10 2018-06-01 2 3240
11 2018-07-01 2 3054
12 2018-02-01 3 3162
13 2018-03-01 3 3193
14 2018-04-01 3 3627
15 2018-05-01 3 3740
16 2018-06-01 3 3042
17 2018-07-01 3 3569
df2:
Date Location Observed
0 2018-02-10 1 3668
1 2018-03-18 1 3102
2 2018-04-15 1 3128
3 2018-05-11 1 3485
4 2018-06-12 1 3926
5 2018-07-11 1 3344
6 2018-02-22 2 3134
7 2018-03-31 2 3258
8 2018-04-02 2 3833
9 2018-05-06 2 3883
10 2018-06-30 2 3122
11 2018-07-21 2 3417
12 2018-02-03 3 3551
13 2018-03-04 3 3971
14 2018-04-01 3 3294
15 2018-05-03 3 3207
16 2018-06-05 3 3803
17 2018-07-25 3 3250
I am looking for end results as picture/plot above. For each 'Location', resample the dates in 'Sim' data to daily freq and then interpolate or extrapolate (if necessary) linearly. Calculate the Delta (Delta=Observed - Sim) only on dates when 'Observed' data is available. Again for each 'Location' a plot similar to one attached above.
My thinking is to use df.groupby method to group each 'Location', series.resample to daily for Sim column in df1. Interpolate linearly df1 daily freq. Calculate the Delta on dates of Observed. And then plot them up.
Upvotes: 4
Views: 699
Reputation: 1570
For the 1st part of your problem, you could concatenate your 2 dataframes, then interpolate and then filter the result according to the 1st time series.
df1 = pd.DataFrame({'Date': ['2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01', '2018-02-01', '2018-03-01', '2018-04-01', '2018-05-01', '2018-06-01', '2018-07-01'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Sim': [3253, 3078, 3222, 3940, 3665, 3856, 3775, 3658, 3056, 3993, 3240, 3054, 3162, 3193, 3627, 3740, 3042, 3569]})
df2 = pd.DataFrame({'Date': ['2018-02-10', '2018-03-18', '2018-04-15', '2018-05-11', '2018-06-12', '2018-07-11', '2018-02-22', '2018-03-31', '2018-04-02', '2018-05-06', '2018-06-30', '2018-07-21', '2018-02-03', '2018-03-04', '2018-04-01', '2018-05-03', '2018-06-05', '2018-07-25'], 'Location': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], 'Observed': [3668, 3102, 3128, 3485, 3926, 3344, 3134, 3258, 3833, 3883, 3122, 3417, 3551, 3971, 3294, 3207, 3803, 3250]})
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index('Date')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index('Date')
Then, groupby, fill missing values & interpolate:
df1_daily = df1.groupby('Location').resample('D').mean()
df1_daily['Location'] = df1_daily.Location.fillna(method='pad')
df1_daily['Sim'] = df1_daily.Sim.interpolate(method='linear')
Prepare merge &... merge:
df2_grouped = df2.set_index(['Location',df2.index])
merge = df1_daily.merge(right=df2_grouped, left_index=True, right_index=True, how='left')#.sort_index()
Finally:
merge['Delta'] = merge.Observed - merge.Sim
merge[['Observed', 'Sim', 'Delta']].groupby('Location').plot.line(marker='o', ms=2)
Upvotes: 1
Reputation: 4638
I would suggest to construct a single dataframe using Series and then interpolate it
Observed= {0: 3668, 1: 3102, 2: 3128, 3: 3485, 4: 3926, 5: 3344, 6: 3134, 7: 3258, 8: 3833, 9: 3883, 10: 3122, 11: 3417, 12: 3551, 13: 3971, 14: 3294, 15: 3207, 16: 3803, 17: 3250}
y1 = pd.Series(Observed, index=Observed)
df = pd.DataFrame({'Date': {0: '2018-02-01', 1: '2018-03-01', 2: '2018-04-01', 3: '2018-05-01', 4: '2018-06-01', 5: '2018-07-01', 6: '2018-02-01', 7: '2018-03-01', 8: '2018-04-01', 9: '2018-05-01', 10: '2018-06-01', 11: '2018-07-01', 12: '2018-02-01', 13: '2018-03-01', 14: '2018-04-01', 15: '2018-05-01', 16: '2018-06-01', 17: '2018-07-01'}, 'Location': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3},
'Sim': {0: 3253, 1: 3078, 2: 3222, 3: 3940, 4: 3665, 5: 3856, 6: 3775, 7: 3658, 8: 3056, 9: 3993, 10: 3240, 11: 3054, 12: 3162, 13: 3193, 14: 3627, 15: 3740, 16: 3042, 17: 3569},
'Observed':Observed})
df.interpolate('index').reindex(Observed)
Upvotes: 1