Reputation: 912
Generating the data
random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 3]),
columns=['data1', 'data2', 'data3'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
# Data column to calculate each day
day = [['data1', 'data2'],
['data1', 'data2'],
['data2', 'data3'],
['data1', 'data3'],
['data2', 'data3'],
['data1', 'data2'],
['data2', 'data3'],
['data1', 'data3']]
I want to select data by each day, and for each day I will calculate mean based on some columns (different for each day), given by the day
variable.
Expected outcome:
mean
2018-01-01 00:00:00 1.0
2018-01-01 01:00:00 5.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 6.0
2018-01-01 04:00:00 5.5
... ...
2018-01-07 20:00:00 4.0
2018-01-07 21:00:00 7.0
2018-01-07 22:00:00 5.0
2018-01-07 23:00:00 NaN
2018-01-08 00:00:00 2.0
I know I can loop through the day and calculate by:
i = 0
b = 0
for day_ in np.unique(test.index.date):
a = test[test.index.date == day_]
print(day_)
print(b, len(a))
daily_mean_df.iloc[b:b+len(a), 0] = np.mean(a[~np.isnan(a[day[i]])], axis = 1)
i += 1
b += len(a)
However for large dataset, this method will take very long time. Please suggest me efficient way to calculate in this problem.
Upvotes: 1
Views: 60
Reputation: 863146
Idea is create boolean mask with same index like first DataFrame
with get_dummies
and max
per rows and DataFrame.reindex
with method='ffill'
:
day = [['data1', 'data2'],
['data1', 'data2'],
['data2', 'data3'],
['data1', 'data3'],
['data2', 'data3'],
['data1', 'data2'],
['data2', 'data3'],
['data1', 'data3']]
df1 = pd.DataFrame(day, index=df.index.floor('d').unique())
df1 = pd.get_dummies(df1, prefix='', prefix_sep='', dtype=bool).max(axis=1, level=0)
df1 = df1.reindex(df.index, method='ffill')
print (df1.head())
data1 data2 data3
2018-01-01 00:00:00 True True False
2018-01-01 01:00:00 True True False
2018-01-01 02:00:00 True True False
2018-01-01 03:00:00 True True False
2018-01-01 04:00:00 True True False
So you can replace non matched values to NaN
s by DataFrame.where
and calculate mean
:
df = df.where(df1).mean(axis=1).to_frame('mean')
print (df)
mean
2018-01-01 00:00:00 6.0
2018-01-01 01:00:00 NaN
2018-01-01 02:00:00 4.0
2018-01-01 03:00:00 4.0
2018-01-01 04:00:00 4.5
...
2018-01-07 20:00:00 NaN
2018-01-07 21:00:00 6.0
2018-01-07 22:00:00 5.0
2018-01-07 23:00:00 3.0
2018-01-08 00:00:00 5.0
[169 rows x 1 columns]
Setup:
np.random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
columns=['data1', 'data2', 'data3'],
index= date_rng)
daily_mean_df = pd.DataFrame(np.zeros([len(date_rng), 1]),
columns=['data1'],
index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan
Verifying solution:
i = 0
b = 0
for day_ in np.unique(df.index.date):
a = df[df.index.date == day_]
# print(day_)
# print(b, len(a))
daily_mean_df.iloc[b:b+len(a), 0] = np.mean(a[~np.isnan(a[day[i]])], axis = 1)
i += 1
b += len(a)
print (daily_mean_df)
data1
2018-01-01 00:00:00 6.0
2018-01-01 01:00:00 NaN
2018-01-01 02:00:00 4.0
2018-01-01 03:00:00 4.0
2018-01-01 04:00:00 4.5
...
2018-01-07 20:00:00 NaN
2018-01-07 21:00:00 6.0
2018-01-07 22:00:00 5.0
2018-01-07 23:00:00 3.0
2018-01-08 00:00:00 5.0
[169 rows x 1 columns]
Upvotes: 1
Reputation: 5622
You can extract the day from a datetime index like this:
df["day"] = df.index.floor('D')
And then group by this new column and compute the average per group with:
df.groupby("day").mean()
Result:
data1 data2 data3
day
2018-01-01 4.428571 3.714286 5.454545
2018-01-02 5.066667 5.444444 5.222222
2018-01-03 4.850000 3.705882 4.812500
2018-01-04 4.555556 4.818182 3.764706
2018-01-05 4.000000 4.058824 3.818182
2018-01-06 6.058824 4.588235 4.866667
2018-01-07 4.176471 4.071429 5.055556
2018-01-08 4.400000 4.400000 4.285714
Note that, by default pandas.mean skip NaN values, but that can be changed with the skipna parameter.
Upvotes: 1