abudis
abudis

Reputation: 2881

Imputation using pandas

I have a multi-year timeseries with half-hourly resolution with some gaps and would like to impute them based on averages of the values of other years, but at the same time. E.g. if a value is missing at 2005-1-1 12:00, I'd like to take all the values at the same time, but from all other years and average them, then impute the missing value by the average. Here's what I got:

import pandas as pd
import numpy as np

idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = None

grouped = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]).mean()

Which gives me the averages I need, but I don't know how to plug them back into the original timeseries.

Upvotes: 2

Views: 1176

Answers (1)

Jianxun Li
Jianxun Li

Reputation: 24742

You are almost there. Just use .tranform to fill NaNs.

import pandas as pd
import numpy as np

# your data
# ==================================================
np.random.seed(0)
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = np.nan


                     somedata
2000-01-01 00:00:00    0.5488
2000-01-01 00:30:00       NaN
2000-01-01 01:00:00    0.6028
2000-01-01 01:30:00    0.5449
2000-01-01 02:00:00    0.4237
2000-01-01 02:30:00    0.6459
2000-01-01 03:00:00    0.4376
2000-01-01 03:30:00       NaN
...                       ...
2009-12-31 20:30:00    0.4983
2009-12-31 21:00:00    0.4282
2009-12-31 21:30:00       NaN
2009-12-31 22:00:00    0.3306
2009-12-31 22:30:00    0.3021
2009-12-31 23:00:00    0.2077
2009-12-31 23:30:00    0.2965
2010-01-01 00:00:00    0.5183

[175345 rows x 1 columns]

# processing
# ==================================================
result = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute], as_index=False).transform(lambda g: g.fillna(g.mean()))

                     somedata
2000-01-01 00:00:00    0.5488
2000-01-01 00:30:00    0.2671
2000-01-01 01:00:00    0.6028
2000-01-01 01:30:00    0.5449
2000-01-01 02:00:00    0.4237
2000-01-01 02:30:00    0.6459
2000-01-01 03:00:00    0.4376
2000-01-01 03:30:00    0.3957
...                       ...
2009-12-31 20:30:00    0.4983
2009-12-31 21:00:00    0.4282
2009-12-31 21:30:00    0.4784
2009-12-31 22:00:00    0.3306
2009-12-31 22:30:00    0.3021
2009-12-31 23:00:00    0.2077
2009-12-31 23:30:00    0.2965
2010-01-01 00:00:00    0.5183

[175345 rows x 1 columns]

# take a look at a particular sample
# ======================================
x = list(df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]

            somedata
2000-01-01    0.5488
2001-01-01    0.1637
2002-01-01    0.3245
2003-01-01       NaN
2004-01-01    0.5654
2005-01-01    0.5729
2006-01-01    0.4740
2007-01-01    0.1728
2008-01-01    0.2577
2009-01-01       NaN
2010-01-01    0.5183

x.mean()  # output: 0.3998

list(result.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]

            somedata
2000-01-01    0.5488
2001-01-01    0.1637
2002-01-01    0.3245
2003-01-01    0.3998
2004-01-01    0.5654
2005-01-01    0.5729
2006-01-01    0.4740
2007-01-01    0.1728
2008-01-01    0.2577
2009-01-01    0.3998
2010-01-01    0.5183

Upvotes: 4

Related Questions