Reputation: 1114
I have a dataframe with a time stamp as the index and a column of labels
df=DataFrame({'time':[ datetime(2015,11,2,4,41,10), datetime(2015,11,2,4,41,39), datetime(2015,11,2,4,41,47),
datetime(2015,11,2,4,41,59), datetime(2015,11,2,4,42,4), datetime(2015,11,2,4,42,11),
datetime(2015,11,2,4,42,15), datetime(2015,11,2,4,42,30), datetime(2015,11,2,4,42,39),
datetime(2015,11,2,4,42,41),datetime(2015,11,2,5,2,9),datetime(2015,11,2, 5,2,10),
datetime(2015,11,2,5,2,16),datetime(2015,11,2,5,2,29),datetime(2015,11,2, 5,2,51),
datetime(2015,11,2,5,9,1),datetime(2015,11,2,5,9,21),datetime(2015,11,2,5,9,31),
datetime(2015,11,2,5,9,40),datetime(2015,11,2,5,9,55)],
'Label':[2,0,0,0,1,0,0,1,1,1,1,3,0,0,3,0,1,0,1,1]}).set_index(['time'])
I want to get the avergae number of times that a label appears in a distinct minute in a distnct hour.
For example, Label 0 appears 3 times in hour 4 in minute 41, 2 times in hour 4
in minute 42,
2 times in hour 5 in in minute 2, and 2 times in hour 5 in minute 9 so its average count per
minute in hour 4 is
(2+3)/2=2.5
and its count per minute in hour 5 is
(2+2)/2=2
The output I am looking for is
Hour 1
Label avg
0 2.5
1 2
2 .5
3 0
Hour 2
Label avg
0 2
1 1.5
2 0
3 1
What I have so far is
df['hour']=df.index.hour
hour_grp=df.groupby(['hour'], as_index=False)
then I can deo something like
res=[]
for key, value in hour_grp:
res.append(value)
then group by minute
res[0].groupby(pd.TimeGrouper('1Min'))['Label'].value_counts()
but this is where I'm stuck, not to mention it is not very efficient
Upvotes: 1
Views: 1144
Reputation: 29719
Accessing minute of DateTimeIndex:
mn = df.index.minute
Accessing hour of DateTimeIndex:
hr = df.index.hour
Perform Groupby
by keeping the above obtained variables as keys. Compute value_counts
of contents under Label and unstack
by filling missing values with 0. Finally, average them across the index-axis containing hour values.
df.groupby([mn,hr])['Label'].value_counts().unstack(fill_value=0).mean(level=1)
Upvotes: 0
Reputation: 13923
Start by squeezing you DataFrame into a Series (after all, it only has one column):
s = df.squeeze()
Compute how many times each label occurs by minute:
counts_by_min = (s.resample('min')
.apply(lambda x: x.value_counts())
.unstack()
.fillna(0))
# 0 1 2 3
# time
# 2015-11-02 04:41:00 3.0 0.0 1.0 0.0
# 2015-11-02 04:42:00 2.0 4.0 0.0 0.0
# 2015-11-02 05:02:00 2.0 1.0 0.0 2.0
# 2015-11-02 05:09:00 2.0 3.0 0.0 0.0
Resample counts_by_min
by hour to obtain the number of times each label occurs by hour:
counts_by_hour = counts_by_min.resample('H').sum()
# 0 1 2 3
# time
# 2015-11-02 04:00:00 5.0 4.0 1.0 0.0
# 2015-11-02 05:00:00 4.0 4.0 0.0 2.0
Count the number of minutes each label occurs by hour:
minutes_by_hour = counts_by_min.astype(bool).resample('H').sum()
# 0 1 2 3
# time
# 2015-11-02 04:00:00 2.0 1.0 1.0 0.0
# 2015-11-02 05:00:00 2.0 2.0 0.0 1.0
Divide the last two to get the result you want:
avg_per_hour = counts_by_hour.div(minutes_by_hour).fillna(0)
# 0 1 2 3
# time
# 2015-11-02 04:00:00 2.5 4.0 1.0 0.0
# 2015-11-02 05:00:00 2.0 2.0 0.0 2.0
Upvotes: 1