mikeL
mikeL

Reputation: 1114

Get average counts per minute by hour

I have a dataframe with a time stamp as the index and a column of labels

df=DataFrame({'time':[ datetime(2015,11,2,4,41,10),     datetime(2015,11,2,4,41,39), datetime(2015,11,2,4,41,47), 
datetime(2015,11,2,4,41,59), datetime(2015,11,2,4,42,4),     datetime(2015,11,2,4,42,11),
datetime(2015,11,2,4,42,15), datetime(2015,11,2,4,42,30),     datetime(2015,11,2,4,42,39), 
 datetime(2015,11,2,4,42,41),datetime(2015,11,2,5,2,9),datetime(2015,11,2,    5,2,10),
datetime(2015,11,2,5,2,16),datetime(2015,11,2,5,2,29),datetime(2015,11,2,    5,2,51),
datetime(2015,11,2,5,9,1),datetime(2015,11,2,5,9,21),datetime(2015,11,2,5,9,31),
datetime(2015,11,2,5,9,40),datetime(2015,11,2,5,9,55)],
'Label':[2,0,0,0,1,0,0,1,1,1,1,3,0,0,3,0,1,0,1,1]}).set_index(['time'])

I want to get the avergae number of times that a label appears in a distinct minute in a distnct hour.

For example, Label 0 appears 3 times in hour 4 in minute 41, 2 times in hour 4 in minute 42,
2 times in hour 5 in in minute 2, and 2 times in hour 5 in minute 9 so its average count per minute in hour 4 is

(2+3)/2=2.5 

and its count per minute in hour 5 is

(2+2)/2=2

The output I am looking for is

Hour 1
Label  avg
0      2.5
1      2
2       .5
3       0


Hour 2
Label  avg
0      2
1      1.5
2      0
3      1

What I have so far is

df['hour']=df.index.hour

hour_grp=df.groupby(['hour'], as_index=False)

then I can deo something like

res=[]
for key, value in hour_grp:
    res.append(value)

then group by minute

res[0].groupby(pd.TimeGrouper('1Min'))['Label'].value_counts()

but this is where I'm stuck, not to mention it is not very efficient

Upvotes: 1

Views: 1144

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29719

Accessing minute of DateTimeIndex:

mn = df.index.minute

Accessing hour of DateTimeIndex:

hr = df.index.hour

Perform Groupby by keeping the above obtained variables as keys. Compute value_counts of contents under Label and unstack by filling missing values with 0. Finally, average them across the index-axis containing hour values.

df.groupby([mn,hr])['Label'].value_counts().unstack(fill_value=0).mean(level=1)

Image

Upvotes: 0

Alicia Garcia-Raboso
Alicia Garcia-Raboso

Reputation: 13923

Start by squeezing you DataFrame into a Series (after all, it only has one column):

s = df.squeeze()

Compute how many times each label occurs by minute:

counts_by_min = (s.resample('min')
                  .apply(lambda x: x.value_counts())
                  .unstack()
                  .fillna(0))

#                        0    1    2    3
# time                                   
# 2015-11-02 04:41:00  3.0  0.0  1.0  0.0
# 2015-11-02 04:42:00  2.0  4.0  0.0  0.0
# 2015-11-02 05:02:00  2.0  1.0  0.0  2.0
# 2015-11-02 05:09:00  2.0  3.0  0.0  0.0

Resample counts_by_min by hour to obtain the number of times each label occurs by hour:

counts_by_hour = counts_by_min.resample('H').sum()

#                        0    1    2    3
# time                                   
# 2015-11-02 04:00:00  5.0  4.0  1.0  0.0
# 2015-11-02 05:00:00  4.0  4.0  0.0  2.0

Count the number of minutes each label occurs by hour:

minutes_by_hour = counts_by_min.astype(bool).resample('H').sum()

#                        0    1    2    3
# time                                   
# 2015-11-02 04:00:00  2.0  1.0  1.0  0.0
# 2015-11-02 05:00:00  2.0  2.0  0.0  1.0

Divide the last two to get the result you want:

avg_per_hour = counts_by_hour.div(minutes_by_hour).fillna(0)

#                        0    1    2    3
# time                                   
# 2015-11-02 04:00:00  2.5  4.0  1.0  0.0
# 2015-11-02 05:00:00  2.0  2.0  0.0  2.0

Upvotes: 1

Related Questions