Neil
Neil

Reputation: 8247

Aggregating hourly count in dplyr

I have following dataframe in R

Date                      ID      
01-01-2017 12:39:00       CDF
01-01-2017 01:39:00       WED
01-01-2017 02:39:00       QWE
01-01-2017 05:39:00       TYU
01-01-2017 17:39:00       ERT
02-01-2017 02:30:34       DEF   

I want to calculate hourly count of ID. My desired dataframe would be

Date           hours               Count
01-01-2017     00:00 - 01:00       1
01-01-2017     01:00 - 02:00       1
01-01-2017     02:00 - 03:00       1
01-01-2017     03:00 - 04:00       0
01-01-2017     04:00 - 05:00       0
01-01-2017     05:00 - 06:00       1
.
01-01-2017     23:00 - 00:00       0 
.
02-01-2017     12:00 - 01:00       0 
02-01-2017     01:00 - 02:00       0
02-01-2017     02:00 - 03:00       1

Where there is no id exists,I want hourly bucket to be zero. Every date will contain 24 hours movement.

How can I achieve this in R?

Upvotes: 1

Views: 1353

Answers (2)

konvas
konvas

Reputation: 14336

tidyverse provides some functions which are useful here, like count/tally and complete

library(tidyverse)
library(lubridate)

dat <- read_csv('Date, ID      
  01-01-2017 12:39:00, CDF
  01-01-2017 01:39:00, WED
  01-01-2017 02:39:00, QWE
  01-01-2017 05:39:00, TYU
  01-01-2017 17:39:00, ERT
  02-01-2017 02:30:34, DEF'
) 

dat %>% 
   mutate(
       Date = dmy_hms(Date),
       day = floor_date(Date, 'day'), 
       hour = hour(Date)
   ) %>%
   group_by(day, hour) %>%
   tally %>%
   complete(day, hour = 0:23, fill = list('n' = 0))


## A tibble: 48 x 3
## Groups:   day [2]
#          day  hour     n
#       <dttm> <int> <dbl>
# 1 2017-01-01     0     0
# 2 2017-01-01     1     1
# 3 2017-01-01     2     1
# 4 2017-01-01     3     0
# 5 2017-01-01     4     0
# 6 2017-01-01     5     1
# 7 2017-01-01     6     0
# 8 2017-01-01     7     0
# 9 2017-01-01     8     0
#10 2017-01-01     9     0
## ... with 38 more rows

Upvotes: 1

Daniel
Daniel

Reputation: 2229

here is one approach using lubridate and base R.

In your provided data set, your first observation is 01-01-2017 12:39:00, but in your desired output there is a count for 00:00 - 01:00. In the code below, 12:39:00 would be regarded as 12:39 P.M, therefore I will assume you mean 00:39:00. Let me know if that is not the case

library(lubridate)
# the data
txt <- "Date,ID      
01-01-2017 00:39:00,CDF
01-01-2017 01:39:00,WED
01-01-2017 02:39:00,QWE
01-01-2017 05:39:00,TYU
01-01-2017 17:39:00,ERT
02-01-2017 02:30:34,DEF"

df <- read.table(text = txt,sep = ",", header = TRUE)
# transforming the date strings into dates
dates <- as.POSIXct(strptime(df$Date, "%d-%m-%Y %H:%M:%S"))
# creating an hourly time sequence from start to end
total_time <- seq(from = floor_date(min(dates), "hour"), to = 
ceiling_date(max(dates), "hour"), by = "hour")

# in case there is more than one occurrence per interval  
count <-  sapply(total_time, function(x) {       
          sum(floor_date(dates,"hour") %in% x) })

data.frame(Date = strftime(total_time, format = "%d-%m-%Y"),
           hours = paste(strftime(total_time, format = "%H:%M"), 
                    strftime(total_time + 60*60, format="%H:%M"),         
                    sep = " - "),
           Count = count)

#          Date         hours Count
# 1  01-01-2017 00:00 - 01:00     1
# 2  01-01-2017 01:00 - 02:00     1
# 3  01-01-2017 02:00 - 03:00     1
# 4  01-01-2017 03:00 - 04:00     0
# 5  01-01-2017 04:00 - 05:00     0
# 6  01-01-2017 05:00 - 06:00     1
# 7  01-01-2017 06:00 - 07:00     0

Upvotes: 1

Related Questions