Matthew
Matthew

Reputation: 1289

Group by hour of day and factor in R

I have a dataset with journey information, such as departure time, departure station, etc. and I am trying to group by departure hour and station.

I've created a subset of the data with only hour of departure and departure station as follows:

> head(d)
  hour  id
1    8 101
2   18 101
3   17 102
4   18 101
5   13 101
6    9 101

In the subset above, there are only two stations (id): 101 and 102.

I am trying to aggregate this data to get the total count of departures per hour, and per station. Something like:

  hour  id  count
1    0 101  10
2    1 101   0
3    2 101   3
...
25   0 102   7
26   1 102   9

I can achieve something close to what I am looking for with:

aggregate(d$hour, by = list(d$hour, d$id), FUN = length)

   Group.1 Group.2  x
1        1     101  1
2        2     101  2
3        6     101  7
4        7     101 38

However, the output above does not provide a zero value for hours which had no departures. For example, the output above is 41 rows long, but I would expect 48 (two stations and 24 hour observations per station).

I have modified the hour field to be a factor:

d$hour <- factor(d$hour, 0:23)

so that the correct count is displayed when table is called:

> table(d$hour)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23 
  0   3   3   0   0   1  10  49  95  69  14  19  20  25  26  28  42 117  77  45  14   8   7   5 

I'm relatively new to R, and I'm guessing there may be a quick fix to this issue, but I've yet to find it. Please let me know if you need any more detail.

Any help would be greatly appreciated.

Upvotes: 1

Views: 204

Answers (2)

IRTFM
IRTFM

Reputation: 263362

Couldn't find a duplicate, (but I'm pretty sure it exists anyway). The xtabs function will count zero entries and the as.data.frame function will convert the wide table-format to long format:

as.data.frame(xtabs( ~hour+id, data=d)  )
   hour  id Freq
1    13 101    1
2    17 101    0
3    18 101    2
4     8 101    1
5     9 101    1
6    13 102    0
7    17 102    1
8    18 102    0
9     8 102    0
10    9 102    0

I suppose a small data-object might not have all the hours represented but you could always append a dummy set of zero hour entries if you wanted to be sure.

Upvotes: 3

smanski
smanski

Reputation: 541

You can use

library(dplyr)
library(tidyr)
d <- data.frame("hour" = c('8', '18', '17', '18', '13', '9'),
                "id" = c('101', '101', '102', '101', '101', '101'))
d$hour <- factor(d$hour, 0:23)
d %>% group_by(hour, id) %>% summarise(count = n()) %>% complete(hour, id) %>%
  mutate(count = ifelse(is.na(count), 0, count))

I hope this helps!

Upvotes: 2

Related Questions