Pandas Time series: Efficient operation on daily data

Question

Generating the data

random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                  index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
                             columns=['data1', 'data2', 'data3'],
                             index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan

# Data column to calculate each day
day = [['data1', 'data2'],
       ['data1', 'data2'],
       ['data2', 'data3'],
       ['data1', 'data3'],
       ['data2', 'data3'],
       ['data1', 'data2'],
       ['data2', 'data3'],
       ['data1', 'data3']]

I want to select data by each day, and for each day I will calculate mean based on some columns (different for each day), given by the day variable.

Expected outcome:

                     mean
2018-01-01 00:00:00   1.0
2018-01-01 01:00:00   5.0
2018-01-01 02:00:00   2.0
2018-01-01 03:00:00   6.0
2018-01-01 04:00:00   5.5
...                   ...
2018-01-07 20:00:00   4.0
2018-01-07 21:00:00   7.0
2018-01-07 22:00:00   5.0
2018-01-07 23:00:00   NaN
2018-01-08 00:00:00   2.0

I know I can loop through the day and calculate by:

i = 0
b = 0
for day_ in np.unique(test.index.date):
    a = test[test.index.date == day_]
    print(day_)
    print(b, len(a))

    daily_mean_df.iloc[b:b+len(a), 0] = np.mean(a[~np.isnan(a[day[i]])], axis = 1)
    i += 1
    b += len(a)

However for large dataset, this method will take very long time. Please suggest me efficient way to calculate in this problem.

jezrael · Accepted Answer

Idea is create boolean mask with same index like first DataFrame with get_dummies and max per rows and DataFrame.reindex with method='ffill':

day = [['data1', 'data2'],
       ['data1', 'data2'],
       ['data2', 'data3'],
       ['data1', 'data3'],
       ['data2', 'data3'],
       ['data1', 'data2'],
       ['data2', 'data3'],
       ['data1', 'data3']]

df1 = pd.DataFrame(day, index=df.index.floor('d').unique())
df1 = pd.get_dummies(df1, prefix='', prefix_sep='', dtype=bool).max(axis=1, level=0)
df1 = df1.reindex(df.index, method='ffill')
print (df1.head())
                     data1  data2  data3
2018-01-01 00:00:00   True   True  False
2018-01-01 01:00:00   True   True  False
2018-01-01 02:00:00   True   True  False
2018-01-01 03:00:00   True   True  False
2018-01-01 04:00:00   True   True  False

So you can replace non matched values to NaNs by DataFrame.where and calculate mean:

df = df.where(df1).mean(axis=1).to_frame('mean')
print (df)
                     mean
2018-01-01 00:00:00   6.0
2018-01-01 01:00:00   NaN
2018-01-01 02:00:00   4.0
2018-01-01 03:00:00   4.0
2018-01-01 04:00:00   4.5
                  ...
2018-01-07 20:00:00   NaN
2018-01-07 21:00:00   6.0
2018-01-07 22:00:00   5.0
2018-01-07 23:00:00   3.0
2018-01-08 00:00:00   5.0

[169 rows x 1 columns]

Setup:

np.random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                 index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 1]),
                             columns=['data1'],
                             index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan

Verifying solution:

i = 0
b = 0
for day_ in np.unique(df.index.date):
    a = df[df.index.date == day_]
#    print(day_)
#    print(b, len(a))

    daily_mean_df.iloc[b:b+len(a), 0] = np.mean(a[~np.isnan(a[day[i]])], axis = 1)
    i += 1
    b += len(a)

print (daily_mean_df)
                     data1
2018-01-01 00:00:00    6.0
2018-01-01 01:00:00    NaN
2018-01-01 02:00:00    4.0
2018-01-01 03:00:00    4.0
2018-01-01 04:00:00    4.5
                   ...
2018-01-07 20:00:00    NaN
2018-01-07 21:00:00    6.0
2018-01-07 22:00:00    5.0
2018-01-07 23:00:00    3.0
2018-01-08 00:00:00    5.0

[169 rows x 1 columns]

Pandas Time series: Efficient operation on daily data

Answers (2)

Related Questions