Reputation: 193
I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue. Thanks!
Upvotes: 1
Views: 1084
Reputation: 120429
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset
function is adapted from this answer of @jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0
Upvotes: 2