m_ana
m_ana

Reputation: 21

How to impute missing value in time series data with mean value of the same day and time in python

I have multivariate time series data with missing values. Is there any way I can impute the missing value with mean value of the same day of week and time? For example, value for account 1 on 2019-2-1 (Friday) at 2am shall be filled with mean value for account 1 on every Friday at 2am.

account              1     2     3
2019-2-1 01:00:00    15    12    10
2019-2-1 02:00:00    Nan   11    9
2019-2-1 03:00:00    10    11    3
...
2019-31-1 22:00:00   11    Nan   4
2019-31-1 23:00:00   Nan   12    4
2019-31-1 24:00:00   10    10    Nan

I have tried interpolate using polynomial but this method is irrelevant in my case. Maybe groupby can help but not sure how to utilize it when working with time-series index.

Upvotes: 1

Views: 673

Answers (3)

use group by to get the average by day of the week and hour then map the average where the value is nan using the dow and hour as a key

 data=[
 ('2019-02-07 01:00:00',  15,    12,    10),
 ('2019-02-07 02:00:00',    np.nan,   11,    9),
 ('2019-02-07 03:00:00',    10,    11,    3),
 ('2019-01-31 22:00:00',   11,    np.nan,   4),
 ('2019-01-31 02:00:00',   np.nan,    12,    4),
 ('2019-01-24 02:00:00',   10,    10,    float('nan'))]

 df=pd.DataFrame(data,columns=['date','acct1','acct2','acct3'])
 df['date']=df['date'].apply(lambda row: datetime2.strptime(str(row),"%Y-%m-%d %H:%M:%S"))
 df['dow']=df['date'].apply(lambda row: row.dayofweek)
 df['hour']=df['date'].apply(lambda row: row.hour)
 df.set_index('date')

 grouped_acct1=df.groupby(['dow','hour'])['acct1'].mean()
 grouped_acct2=df.groupby(['dow','hour'])['acct2'].mean()
 grouped_acct3=df.groupby(['dow','hour'])['acct3'].mean()

 for key,item in df.iterrows():
     df.loc[key,'acct1_mean']= grouped_acct1[(grouped_acct1.index.get_level_values(0)==item.dow) & (grouped_acct1.index.get_level_values(1)==item.hour)].values
     df.loc[key,'acct2_mean']= grouped_acct2[(grouped_acct2.index.get_level_values(0)==item.dow) & (grouped_acct2.index.get_level_values(1)==item.hour)].values
     df.loc[key,'acct3_mean']= grouped_acct3[(grouped_acct3.index.get_level_values(0)==item.dow) & (grouped_acct3.index.get_level_values(1)==item.hour)].values


 for key,item in df.iterrows():
      if math.isnan(item['acct1']):
          df.loc[key,'acct1']=item['acct1_mean'] 
      else: 
          df.loc[key,'acct1']=item['acct1']   

     if math.isnan(item['acct2']):
         df.loc[key,'acct2']=item['acct2_mean'] 
     else: 
         df.loc[key,'acct2']=item['acct2']   

     if math.isnan(item['acct3']):
         df.loc[key,'acct3']=item['acct3_mean'] 
     else: 
         df.loc[key,'acct3']=item['acct3']   

  print(df)

output

  date                  acct1  acct2  acct3  dow  hour  acct1_mean  acct2_mean  \
0 2019-02-07 01:00:00   15.0   12.0   10.0    3     1        15.0        12.0   
1 2019-02-07 02:00:00   10.0   11.0    9.0    3     2        10.0        11.0   
2 2019-02-07 03:00:00   10.0   11.0    3.0    3     3        10.0        11.0   
3 2019-01-31 22:00:00   11.0    NaN    4.0    3    22        11.0         NaN   
4 2019-01-31 02:00:00   10.0   12.0    4.0    3     2        10.0        11.0   
5 2019-01-24 02:00:00   10.0   10.0    6.5    3     2        10.0        11.0    

   acct3_mean  
0        10.0  
1         6.5  
2         3.0  
3         4.0  
4         6.5  
5         6.5  

Upvotes: 0

AlexK
AlexK

Reputation: 3011

If you are not concerned about using future values for imputation, you can use this for each column:

Sample pandas dataframe (with 'date_time' index of dtype='datetime64[ns]'):

                    acct_1  acct_2  acct_3
date_time           
2019-02-07 01:00:00   15.0    12.0    10.0
2019-02-07 02:00:00    NaN    11.0     9.0
2019-02-07 03:00:00   10.0    11.0     3.0
2019-01-31 22:00:00   11.0     NaN     4.0
2019-01-31 02:00:00    NaN    12.0     4.0
2019-01-24 02:00:00   10.0    10.0     NaN

Code and result:

df['acct_1'] = (df
    .groupby((df.index.dayofweek * 24) + (df.index.hour))
    .transform(lambda x: x.fillna(x.mean())
)
df
                    acct_1  acct_2  acct_3
date_time           
2019-02-07 01:00:00   15.0    12.0    10.0
2019-02-07 02:00:00   10.0    11.0     9.0
2019-02-07 03:00:00   10.0    11.0     3.0
2019-01-31 22:00:00   11.0     NaN     4.0
2019-01-31 02:00:00   10.0    12.0     4.0
2019-01-24 02:00:00   10.0    10.0     NaN

Upvotes: 1

abysslover
abysslover

Reputation: 788

You could try to use groupby, interpolate or rolling as follows:

import pandas as pd
import numpy as np
if __name__ == "__main__":
    df = pd.DataFrame({"account": pd.date_range(start="2019-02-01", periods=1000, freq="H"), "1": range(1000), "2": range(1000,2000), "3": range(1000,0,-1)})
    df.loc[2:3, "1"] = np.nan
    df.loc[6, "2"] = np.nan
    df.loc[7, "3"] = np.nan
    df["day"] = df["account"].dt.date
    print(df)
#     df_result = df.groupby("day").apply(lambda day: day.interpolate(method="linear"))
    df_groupby = df.groupby("day")
    df["1"] = df_groupby["1"].transform(lambda col: col.rolling(2, min_periods=1).mean())
    print(df)

Results:

                account      1       2       3         day
0   2019-02-01 00:00:00    0.0  1000.0  1000.0  2019-02-01
1   2019-02-01 01:00:00    1.0  1001.0   999.0  2019-02-01
2   2019-02-01 02:00:00    NaN  1002.0   998.0  2019-02-01
3   2019-02-01 03:00:00    NaN  1003.0   997.0  2019-02-01
4   2019-02-01 04:00:00    4.0  1004.0   996.0  2019-02-01
..                  ...    ...     ...     ...         ...
995 2019-03-14 11:00:00  995.0  1995.0     5.0  2019-03-14
996 2019-03-14 12:00:00  996.0  1996.0     4.0  2019-03-14
997 2019-03-14 13:00:00  997.0  1997.0     3.0  2019-03-14
998 2019-03-14 14:00:00  998.0  1998.0     2.0  2019-03-14
999 2019-03-14 15:00:00  999.0  1999.0     1.0  2019-03-14

[1000 rows x 5 columns]
                account      1       2       3         day
0   2019-02-01 00:00:00    0.0  1000.0  1000.0  2019-02-01
1   2019-02-01 01:00:00    0.5  1001.0   999.0  2019-02-01
2   2019-02-01 02:00:00    1.0  1002.0   998.0  2019-02-01
3   2019-02-01 03:00:00    NaN  1003.0   997.0  2019-02-01
4   2019-02-01 04:00:00    4.0  1004.0   996.0  2019-02-01
..                  ...    ...     ...     ...         ...
995 2019-03-14 11:00:00  994.5  1995.0     5.0  2019-03-14
996 2019-03-14 12:00:00  995.5  1996.0     4.0  2019-03-14
997 2019-03-14 13:00:00  996.5  1997.0     3.0  2019-03-14
998 2019-03-14 14:00:00  997.5  1998.0     2.0  2019-03-14
999 2019-03-14 15:00:00  998.5  1999.0     1.0  2019-03-14

[1000 rows x 5 columns]

Upvotes: 0

Related Questions