Reputation: 93
My data frame is as follows
df <- tibble::tribble(
~date, ~pcp,
"9/27/2017 9:00", 0,
"9/27/2017 10:00", 0,
"9/27/2017 11:00", 0,
"9/27/2017 12:00", 0,
"9/27/2017 13:00", 0,
"9/27/2017 14:00", 0,
"9/27/2017 15:00", 0,
"9/27/2017 16:00", 0,
"9/27/2017 17:00", 0,
"9/27/2017 18:00", 0,
"9/27/2017 19:00", 0,
"9/27/2017 20:00", 0,
"9/27/2017 21:00", 0,
"9/27/2017 22:00", 0,
"9/27/2017 23:00", 0,
"9/28/2017 0:00", 0,
"9/28/2017 1:00", 0,
"9/28/2017 2:00", 0,
"9/28/2017 3:00", 0,
"9/28/2017 4:00", 0,
"9/28/2017 5:00", 0,
"9/28/2017 6:00", 0,
"9/28/2017 7:00", 0.15,
"9/28/2017 8:00", 8.76,
"9/28/2017 9:00", 0.02,
"9/28/2017 10:00", 0,
"9/28/2017 11:00", 0,
"9/28/2017 12:00", 0,
"9/28/2017 13:00", 0,
"9/28/2017 14:00", 0,
"9/28/2017 15:00", 0,
"9/28/2017 16:00", 0,
"9/28/2017 17:00", 0,
"9/28/2017 18:00", 0,
"9/28/2017 19:00", 0,
"9/28/2017 20:00", 0,
"9/28/2017 21:00", 0,
"9/28/2017 22:00", 0,
"9/28/2017 23:00", 0,
"9/29/2017 0:00", 0,
"9/29/2017 1:00", 0,
"9/29/2017 2:00", 0,
"9/29/2017 3:00", 0,
"9/29/2017 4:00", 0,
"9/29/2017 5:00", 0,
"9/29/2017 6:00", 0,
"9/29/2017 7:00", 0,
"9/29/2017 8:00", 0.31
)
I would like to have a daily aggregate of the data (sum). Instead of aggregating from 00:00 to 23:59 of the same day, I would like it with the initial time starting at 09:00 day i
and ending at 08:59 of day i + 1
(24 h later).
The output is desire is like the following
9/28/2017,8.91
9/29/2017,0.33
I did it manually in Excel, I'm not sure what code to use for this problem. The provided example is an extract of the long data frame. Thanks...
Upvotes: 2
Views: 315
Reputation: 19783
Solution using base R and data.table
(it often works better when performance matters or as alternative to dplyr
and/or lubridate
):
Step 1: setup and create R data frame
library(data.table)
mydt = fread(input = "date,pcp
9/27/2017 9:00,0
9/27/2017 10:00,0
9/27/2017 11:00,0
9/27/2017 12:00,0
9/27/2017 13:00,0
9/27/2017 14:00,0
9/27/2017 15:00,0
9/27/2017 16:00,0
9/27/2017 17:00,0
9/27/2017 18:00,0
9/27/2017 19:00,0
9/27/2017 20:00,0
9/27/2017 21:00,0
9/27/2017 22:00,0
9/27/2017 23:00,0
9/28/2017 0:00,0
9/28/2017 1:00,0
9/28/2017 2:00,0
9/28/2017 3:00,0
9/28/2017 4:00,0
9/28/2017 5:00,0
9/28/2017 6:00,0
9/28/2017 7:00,0.15
9/28/2017 8:00,8.76
9/28/2017 9:00,0.02
9/28/2017 10:00,0
9/28/2017 11:00,0
9/28/2017 12:00,0
9/28/2017 13:00,0
9/28/2017 14:00,0
9/28/2017 15:00,0
9/28/2017 16:00,0
9/28/2017 17:00,0
9/28/2017 18:00,0
9/28/2017 19:00,0
9/28/2017 20:00,0
9/28/2017 21:00,0
9/28/2017 22:00,0
9/28/2017 23:00,0
9/29/2017 0:00,0
9/29/2017 1:00,0
9/29/2017 2:00,0
9/29/2017 3:00,0
9/29/2017 4:00,0
9/29/2017 5:00,0
9/29/2017 6:00,0
9/29/2017 7:00,0
9/29/2017 8:00,0.31")
Step 2: parse date and time into new column ts
containing timestamp:
mydt[, ts := as.POSIXct(date, format="%m/%d/%Y %H:%M")]
Step 3: compute time intervals and place dates in each interval according to the rules described in the question.
Define boundaries (in seconds, using epoch - see below) containing all dates from the frame starting and ending on the 9th hour:
epoch_start = as.integer(format(trunc(min(mydt$ts), "day") - (24 - 9) * 60 * 60, "%s"))
epoch_end = as.integer(format(trunc(max(mydt$ts), "day") + (24 + 9) * 60 * 60, "%s"))
Epoch is the number of seconds that have elapsed since January 1, 1970 - for more details scroll down to What is epoch time? here.
Having computed the time interval encompassing all dates from the frame we can build our custom intervals of 24 hours starting with the 9th hour using base R:
time_seconds_intervals = seq(epoch_start, epoch_end, 60 * 60 * 24)
If you like to see what date and time behind any epoch value use package anytime
:
anytime::anytime(epoch_start)
[1] "2017-09-26 09:00:00 CDT"
anytime::anytime(epoch_end)
[1] "2017-09-30 09:00:00 CDT"
and for the intervals we just built:
anytime::anytime(time_seconds_intervals)
[1] "2017-09-26 09:00:00 CDT" "2017-09-27 09:00:00 CDT" "2017-09-28 09:00:00 CDT" "2017-09-29 09:00:00 CDT"
[5] "2017-09-30 09:00:00 CDT"
Step 4: assign each row to an interval it belongs to using findInterval
and create new column day_group
mydt[, day_group := findInterval(as.integer(format(ts, "%s")), time_seconds_intervals)]
Step 5: summarize with grouping by newly defined day_group
mydt[, .(date_group = min(ts), sum_pcp = sum(pcp)), by=day_group]
which produces:
day_group date_group sum_pcp
1: 2 2017-09-27 09:00:00 8.91
2: 3 2017-09-28 09:00:00 0.33
P.S.
I tried using anytime
to parse dates in Step 2 above, but it failed due to single digit hours used in your strings, unfortunately. anytime
is way faster than anything lubridate
or base R offers to parse dates (many examples exist, here is one I posted recently: https://stackoverflow.com/a/44183836/59470), but unless you change hours to always contain 2 digits like 09
it won't work on your data. In case you do change it that's how Step 2 would look:
mydt[, ts := anytime::anytime(date)]
Upvotes: 1
Reputation: 4220
If you want the data for each date
library(tidyverse)
library(lubridate)
df %>%
mutate(datetime = parse_date_time(date, "mdy H:M"),
date = date(datetime)) %>%
group_by(date) %>%
summarise(sum_pcp = sum(pcp))
Will produce
# A tibble: 3 x 2
date sum_pcp
<date> <dbl>
1 2017-09-27 0
2 2017-09-28 8.93
3 2017-09-29 0.31
If you want to count from 9:00 to 9:00 of the following day, you could introduce a subjective_day
by subtracting 9 hours from the original datetime
object.
df %>%
mutate(datetime = parse_date_time(date, "mdy H:M"),
date = date(datetime),
initial_day = date(first(date)),
time = hour(datetime),
subjective_day = datetime - hours(9)) %>%
group_by(subjective_day = floor_date(subjective_day, "1 day")) %>%
summarise(sum_pcp = sum(pcp))
Will produce
subjective_day sum_pcp
<dttm> <dbl>
1 2017-09-27 00:00:00 8.91
2 2017-09-28 00:00:00 0.33
Your subjective day will always be 1 day behind, so you can just adjust for that or have in mind that somewhere with a 9 hour time difference in the world would actually have that as the correct datetime :)
Upvotes: 1