Reputation: 1000
I have been using the following dplyr code to generate hourly averages from 1-minute time-series data. The code has been working for months, but has recently been producing some problematic results. Has something changed with any of the following functions: group_by()
, cut()
, or summarise()
?
df <- structure(list(date = structure(c(1505187300, 1505187360, 1505187420, 1505187480, 1505187540, 1505187600, 1505187660, 1505187720, 1505201580, 1505201640), class = c("POSIXct", "POSIXt"), tzone = "UTC"), co = c(0.149,0.149,0.149, 0.106, 0.149, 0.149, 0.192, 0.149, 0.149, 0.149), co2 = c(544L, 545L, 544L, 543L, 546L, 546L, 548L, 547L, 549L, 554L), VOC = c(22.55, 22.55, 22.8198, 23.2602, 22.9501, 23.2154, 23.4262, 23.0231, 23.0525, 22.7911), RH = c(77.02, 76.9, 77.2, 76.6, 76.99, 76.83, 77.13, 77.81, 77.48, 77.1), ugm3 = c(12.862, 13.408, 14.188, 12.342, 13.278, 12.81, 10.834, 13.018, 12.992, 12.498), temp = c(62.06, 62.02, 62.02, 61.98, 61.94, 61.9, 61.86, 61.78, 61.8, 61.8)), .Names = c("date", "co", "co2", "VOC", "RH", "ugm3", "temp"), row.names = c(NA, 10L), class = "data.frame")
new_df <- df %>%
group_by(date = cut(date, breaks = "1 hour")) %>%
summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))
new_df
Expected output:
expected_output <- structure(list(date = structure(c(1L, 5L), .Label = c("2017-09-12 03:00:00", "2017-09-12 04:00:00", "2017-09-12 05:00:00", "2017-09-12 06:00:00", "2017-09-12 07:00:00"), class = "factor"), co = c(0.149, 0.149), co2 = c(545.375, 551.5), VOC = c(22.97435, 22.9218), RH = c(77.06, 77.29), ugm3 = c(12.8425, 12.745), temp = c(61.945, 61.8)), class = c("tbl_df", "tbl", "data.frame"), .Names = c("date", "co", "co2", "VOC", "RH", "ugm3", "temp"), row.names = c(NA, -2L))
Actual output:
actual_output <- structure(list(co = 0.149, co2 = 546.6, VOC = 22.96384, RH = 77.106, ugm3 = 12.823, temp = 61.916), .Names = c("co", "co2", "VOC", "RH", "ugm3", "temp"), class = "data.frame", row.names = c(NA, -1L))
Prior to this week, this code would have generated a new df
with two observations, one for the 03:00:00
hour, and one for the 07:00:00
hour. While the group_by()
function appears to be assigning the new hourly timestamps correctly, the summarize()
function is not behaving properly. Any insight is appreciated. Thanks!
If there are more robust alternatives to aggregating time-series data into specific intervals, I'm all ears!
Upvotes: 2
Views: 841
Reputation: 47340
You loaded plyr
after dplyr
.
library(dplyr)
# ...
library(plyr)
# ------------------------------------------------------------------------------# -------------------------------------------
#
# Attachement du package : ‘plyr’
#
# The following objects are masked from ‘package:dplyr’:
#
# arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
We should always read those warnings :). Now let's see what happens:
df %>%
group_by(date = cut(date, breaks = "1 hour")) %>%
summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))
# co co2 VOC RH ugm3 temp
# 1 0.149 546.6 22.96384 77.106 12.823 61.916
If you load dplyr
after plyr
, or use dplyr::summarize
, you'll have the expected behavior.
df %>%
group_by(date = cut(date, breaks = "1 hour")) %>%
dplyr::summarize(co = mean(co), co2 = mean(co2), VOC = mean(VOC), RH = mean(RH), ugm3 = mean(ugm3), temp = mean(temp))
# # A tibble: 2 x 7
# date co co2 VOC RH ugm3 temp
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2017-09-12 03:00:00 0.149 545.375 22.97435 77.06 12.8425 61.945
# 2 2017-09-12 07:00:00 0.149 551.500 22.92180 77.29 12.7450 61.800
Upvotes: 3