saramarah
saramarah

Reputation: 23

How to groupby for dataframe which has a datetimeindex only using hours

I've got a dataframe called new_dh of web request that looks like (there are more columns

                    s-sitename     sc-win32-status
date_time                                                           
2006-11-01 00:00:00     W3SVC1          0.0
2006-11-01 00:00:00     W3SVC1          0.0
2006-11-01 01:00:00     W3SVC1          0.0
2006-11-01 01:00:00     W3SVC1          0.0
2006-11-01 02:00:00     W3SVC1          0.0
2007-02-28 02:00:00     W3SVC1          0.0
2007-02-28 10:00:00     W3SVC1          0.0
2007-02-28 23:00:00     W3SVC1          0.0
2007-02-28 23:00:00     W3SVC1          0.0
2007-02-28 23:00:00     W3SVC1          0.0

What I would like to do is group by the hours(the actual date of the request does not matter, just the hour and all the times have already been rounded down to not include minutes) for the datetimeindex and instead return

                    count 
hour                                                           
0                    2
01                   2
02                   2
10                   1
23                   3

Any help would be much appreciated.

I have tried

new_dh.groupby([new_dh.index.hour]).count()

but find myself printing many columns of the same value whereas I only want the above version

Upvotes: 1

Views: 52

Answers (3)

PieCot
PieCot

Reputation: 3629

new_dh['hour'] = new_dh.index.map(lambda x: x.hour)
new_dh.groupby('hour')['hour'].count()

Result

hour
0     2
1     2
2     2
10    1
23    3
Name: hour, dtype: int64

If you need a DataFrame as result:

new_dh.groupby('hour')['hour'].count().rename('count').to_frame()

In this case, the result will be:

      count
hour       
0         2
1         2
2         2
10        1
23        3

Upvotes: 1

Anurag Dabas
Anurag Dabas

Reputation: 24314

You can also do this by using groupby() and assign() method:

If 'date_time' column is not your index:

result=df.assign(hour=df['date_time'].dt.hour).groupby('hour').agg(count=('s-sitename','count'))

If It's your index then use:

result=df.groupby(df.index.hour)['s-sitename'].count().to_frame('count')
result.index.name='hour'

Now if you print result then you will get your desired output:

      count
hour    
0       1
1       2
2       2
10      1
23      3

Upvotes: 0

jezrael
jezrael

Reputation: 862511

If need DatetimeIndex in output use DataFrame.resample:

new_dh.resample('H')['s-sitename'].count()

Or DatetimeIndex.floor:

new_dh.groupby(new_dh.index.floor('H'))['s-sitename'].count()

Problem of your solution is if use GroupBy.count it count all columns value per Hours with exclude missing values, so if no missing values get multiple columns with same values. Possible solution is specify column after groupby:

new_dh.groupby([new_dh.index.hour])['s-sitename'].count()

So data was changed for see how count with exclude missing values:

print (new_dh)
                    s-sitename  sc-win32-status
date_time                                      
2006-11-01 00:00:00     W3SVC1              0.0
2006-11-01 00:00:00     W3SVC1              0.0
2006-11-01 01:00:00     W3SVC1              0.0
2006-11-01 01:00:00     W3SVC1              0.0
2006-11-01 02:00:00        NaN              0.0
2007-02-28 02:00:00     W3SVC1              0.0
2007-02-28 10:00:00     W3SVC1              0.0
2007-02-28 23:00:00        NaN              0.0
2007-02-28 23:00:00        NaN              0.0
2007-02-28 23:00:00     W3SVC1              0.0


df = new_dh.groupby([new_dh.index.hour]).count()

print (df)
           s-sitename  sc-win32-status
date_time                             
0                   2                2
1                   2                2
2                   1                2
10                  1                1
23                  1                3

So if column is specified:

s = new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
print (s)
date_time
0     2
1     2
2     1
10    1
23    1
Name: s-sitename, dtype: int64

df = new_dh.groupby([new_dh.index.hour])['s-sitename'].count().to_frame()

print (df)
           s-sitename
date_time            
0                   2
1                   2
2                   1
10                  1
23                  1

If need count also missing values then use GroupBy.size:

s = new_dh.groupby([new_dh.index.hour])['s-sitename'].size()
print (s)
date_time
0     2
1     2
2     2
10    1
23    3
Name: s-sitename, dtype: int64

df = new_dh.groupby([new_dh.index.hour])['s-sitename'].size().to_frame()

print (df)
           s-sitename
date_time            
0                   2
1                   2
2                   2
10                  1
23                  3

Upvotes: 3

Related Questions