Reputation: 21
I have multivariate time series data with missing values. Is there any way I can impute the missing value with mean value of the same day of week and time? For example, value for account 1 on 2019-2-1 (Friday) at 2am shall be filled with mean value for account 1 on every Friday at 2am.
account 1 2 3
2019-2-1 01:00:00 15 12 10
2019-2-1 02:00:00 Nan 11 9
2019-2-1 03:00:00 10 11 3
...
2019-31-1 22:00:00 11 Nan 4
2019-31-1 23:00:00 Nan 12 4
2019-31-1 24:00:00 10 10 Nan
I have tried interpolate
using polynomial but this method is irrelevant in my case. Maybe groupby
can help but not sure how to utilize it when working with time-series index.
Upvotes: 1
Views: 673
Reputation: 4233
use group by to get the average by day of the week and hour then map the average where the value is nan using the dow and hour as a key
data=[
('2019-02-07 01:00:00', 15, 12, 10),
('2019-02-07 02:00:00', np.nan, 11, 9),
('2019-02-07 03:00:00', 10, 11, 3),
('2019-01-31 22:00:00', 11, np.nan, 4),
('2019-01-31 02:00:00', np.nan, 12, 4),
('2019-01-24 02:00:00', 10, 10, float('nan'))]
df=pd.DataFrame(data,columns=['date','acct1','acct2','acct3'])
df['date']=df['date'].apply(lambda row: datetime2.strptime(str(row),"%Y-%m-%d %H:%M:%S"))
df['dow']=df['date'].apply(lambda row: row.dayofweek)
df['hour']=df['date'].apply(lambda row: row.hour)
df.set_index('date')
grouped_acct1=df.groupby(['dow','hour'])['acct1'].mean()
grouped_acct2=df.groupby(['dow','hour'])['acct2'].mean()
grouped_acct3=df.groupby(['dow','hour'])['acct3'].mean()
for key,item in df.iterrows():
df.loc[key,'acct1_mean']= grouped_acct1[(grouped_acct1.index.get_level_values(0)==item.dow) & (grouped_acct1.index.get_level_values(1)==item.hour)].values
df.loc[key,'acct2_mean']= grouped_acct2[(grouped_acct2.index.get_level_values(0)==item.dow) & (grouped_acct2.index.get_level_values(1)==item.hour)].values
df.loc[key,'acct3_mean']= grouped_acct3[(grouped_acct3.index.get_level_values(0)==item.dow) & (grouped_acct3.index.get_level_values(1)==item.hour)].values
for key,item in df.iterrows():
if math.isnan(item['acct1']):
df.loc[key,'acct1']=item['acct1_mean']
else:
df.loc[key,'acct1']=item['acct1']
if math.isnan(item['acct2']):
df.loc[key,'acct2']=item['acct2_mean']
else:
df.loc[key,'acct2']=item['acct2']
if math.isnan(item['acct3']):
df.loc[key,'acct3']=item['acct3_mean']
else:
df.loc[key,'acct3']=item['acct3']
print(df)
output
date acct1 acct2 acct3 dow hour acct1_mean acct2_mean \
0 2019-02-07 01:00:00 15.0 12.0 10.0 3 1 15.0 12.0
1 2019-02-07 02:00:00 10.0 11.0 9.0 3 2 10.0 11.0
2 2019-02-07 03:00:00 10.0 11.0 3.0 3 3 10.0 11.0
3 2019-01-31 22:00:00 11.0 NaN 4.0 3 22 11.0 NaN
4 2019-01-31 02:00:00 10.0 12.0 4.0 3 2 10.0 11.0
5 2019-01-24 02:00:00 10.0 10.0 6.5 3 2 10.0 11.0
acct3_mean
0 10.0
1 6.5
2 3.0
3 4.0
4 6.5
5 6.5
Upvotes: 0
Reputation: 3011
If you are not concerned about using future values for imputation, you can use this for each column:
Sample pandas dataframe (with 'date_time' index of dtype='datetime64[ns]'):
acct_1 acct_2 acct_3
date_time
2019-02-07 01:00:00 15.0 12.0 10.0
2019-02-07 02:00:00 NaN 11.0 9.0
2019-02-07 03:00:00 10.0 11.0 3.0
2019-01-31 22:00:00 11.0 NaN 4.0
2019-01-31 02:00:00 NaN 12.0 4.0
2019-01-24 02:00:00 10.0 10.0 NaN
Code and result:
df['acct_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour))
.transform(lambda x: x.fillna(x.mean())
)
df
acct_1 acct_2 acct_3
date_time
2019-02-07 01:00:00 15.0 12.0 10.0
2019-02-07 02:00:00 10.0 11.0 9.0
2019-02-07 03:00:00 10.0 11.0 3.0
2019-01-31 22:00:00 11.0 NaN 4.0
2019-01-31 02:00:00 10.0 12.0 4.0
2019-01-24 02:00:00 10.0 10.0 NaN
Upvotes: 1
Reputation: 788
You could try to use groupby
, interpolate
or rolling
as follows:
import pandas as pd
import numpy as np
if __name__ == "__main__":
df = pd.DataFrame({"account": pd.date_range(start="2019-02-01", periods=1000, freq="H"), "1": range(1000), "2": range(1000,2000), "3": range(1000,0,-1)})
df.loc[2:3, "1"] = np.nan
df.loc[6, "2"] = np.nan
df.loc[7, "3"] = np.nan
df["day"] = df["account"].dt.date
print(df)
# df_result = df.groupby("day").apply(lambda day: day.interpolate(method="linear"))
df_groupby = df.groupby("day")
df["1"] = df_groupby["1"].transform(lambda col: col.rolling(2, min_periods=1).mean())
print(df)
Results:
account 1 2 3 day
0 2019-02-01 00:00:00 0.0 1000.0 1000.0 2019-02-01
1 2019-02-01 01:00:00 1.0 1001.0 999.0 2019-02-01
2 2019-02-01 02:00:00 NaN 1002.0 998.0 2019-02-01
3 2019-02-01 03:00:00 NaN 1003.0 997.0 2019-02-01
4 2019-02-01 04:00:00 4.0 1004.0 996.0 2019-02-01
.. ... ... ... ... ...
995 2019-03-14 11:00:00 995.0 1995.0 5.0 2019-03-14
996 2019-03-14 12:00:00 996.0 1996.0 4.0 2019-03-14
997 2019-03-14 13:00:00 997.0 1997.0 3.0 2019-03-14
998 2019-03-14 14:00:00 998.0 1998.0 2.0 2019-03-14
999 2019-03-14 15:00:00 999.0 1999.0 1.0 2019-03-14
[1000 rows x 5 columns]
account 1 2 3 day
0 2019-02-01 00:00:00 0.0 1000.0 1000.0 2019-02-01
1 2019-02-01 01:00:00 0.5 1001.0 999.0 2019-02-01
2 2019-02-01 02:00:00 1.0 1002.0 998.0 2019-02-01
3 2019-02-01 03:00:00 NaN 1003.0 997.0 2019-02-01
4 2019-02-01 04:00:00 4.0 1004.0 996.0 2019-02-01
.. ... ... ... ... ...
995 2019-03-14 11:00:00 994.5 1995.0 5.0 2019-03-14
996 2019-03-14 12:00:00 995.5 1996.0 4.0 2019-03-14
997 2019-03-14 13:00:00 996.5 1997.0 3.0 2019-03-14
998 2019-03-14 14:00:00 997.5 1998.0 2.0 2019-03-14
999 2019-03-14 15:00:00 998.5 1999.0 1.0 2019-03-14
[1000 rows x 5 columns]
Upvotes: 0