Reputation: 1289
I have a dataset with journey information, such as departure time, departure station, etc. and I am trying to group by departure hour and station.
I've created a subset of the data with only hour of departure and departure station as follows:
> head(d)
hour id
1 8 101
2 18 101
3 17 102
4 18 101
5 13 101
6 9 101
In the subset above, there are only two stations (id
): 101 and 102.
I am trying to aggregate this data to get the total count of departures per hour, and per station. Something like:
hour id count
1 0 101 10
2 1 101 0
3 2 101 3
...
25 0 102 7
26 1 102 9
I can achieve something close to what I am looking for with:
aggregate(d$hour, by = list(d$hour, d$id), FUN = length)
Group.1 Group.2 x
1 1 101 1
2 2 101 2
3 6 101 7
4 7 101 38
However, the output above does not provide a zero value for hours which had no departures. For example, the output above is 41 rows long, but I would expect 48 (two stations and 24 hour observations per station).
I have modified the hour
field to be a factor:
d$hour <- factor(d$hour, 0:23)
so that the correct count is displayed when table
is called:
> table(d$hour)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0 3 3 0 0 1 10 49 95 69 14 19 20 25 26 28 42 117 77 45 14 8 7 5
I'm relatively new to R, and I'm guessing there may be a quick fix to this issue, but I've yet to find it. Please let me know if you need any more detail.
Any help would be greatly appreciated.
Upvotes: 1
Views: 204
Reputation: 263362
Couldn't find a duplicate, (but I'm pretty sure it exists anyway). The xtabs
function will count zero entries and the as.data.frame
function will convert the wide table
-format to long format:
as.data.frame(xtabs( ~hour+id, data=d) )
hour id Freq
1 13 101 1
2 17 101 0
3 18 101 2
4 8 101 1
5 9 101 1
6 13 102 0
7 17 102 1
8 18 102 0
9 8 102 0
10 9 102 0
I suppose a small data-object might not have all the hours represented but you could always append a dummy set of zero hour entries if you wanted to be sure.
Upvotes: 3
Reputation: 541
You can use
library(dplyr)
library(tidyr)
d <- data.frame("hour" = c('8', '18', '17', '18', '13', '9'),
"id" = c('101', '101', '102', '101', '101', '101'))
d$hour <- factor(d$hour, 0:23)
d %>% group_by(hour, id) %>% summarise(count = n()) %>% complete(hour, id) %>%
mutate(count = ifelse(is.na(count), 0, count))
I hope this helps!
Upvotes: 2