Reputation: 1477
I have the following problem. I have a dataframe of the following strucutre:
startdatetime enddatetime type amount
1 2019-02-01 03:35:00 2019-02-03 06:35:00 prod1 1e+03
2 2019-02-03 06:35:00 2019-02-05 09:35:00 prod1 5e+03
3 2019-02-05 09:35:00 2019-02-06 01:35:00 prod2 3e+07
4 2019-02-06 01:35:00 2019-02-06 03:35:00 prod1 1e+02
representing an amount produced in a certain time span (startdatetime and enddatetime). Now I want to aggregate these data on a daily basis. Lets ignore the incomplete day 2019-02-01 and start with 2019-02-02. First prod 1 was produced between 2019-02-01 03:35:00 and 2019-02-03 06:35:00 in total 1000 kg were produced. So for example, on 2019-02-02: 24/51*1000
= 470.58 of prod 1 were produced because 24h + 21h + 6h = 51h
.
The solution I have so far is based on a for and a while loop, but I guess there is a faster solution based on the package "lubridate" or else which I did not find.
Any suggestion?
Below my code
#create test data set
mydata <- data.frame(startdatetime=c(as.POSIXct("2019-02-01 03:35:00"), as.POSIXct("2019-02-03 06:35:00"),as.POSIXct("2019-02-05 09:35:00"),as.POSIXct("2019-02-06 01:35:00")),
enddatetime =c(as.POSIXct("2019-02-03 06:35:00"), as.POSIXct("2019-02-05 09:35:00"),as.POSIXct("2019-02-06 01:35:00"),as.POSIXct("2019-02-06 03:35:00")),
type=c("prod1","prod1","prod2","prod1"),
amount=c(1000,5000,30000000,100))
# take only full days into account and ignore the first and the last day
minstartday = min(mydata$startdatetime)+24*60*60
maxendday = max(mydata$enddatetime)-24*60*60
#create a day index
timesindex <- seq(from = as.Date(format(minstartday, format = "%Y/%m/%d")),
to = as.Date(format(maxendday, format = "%Y/%m/%d")), by = "day")
# create an empty dataframe which will be filled with the production data for each day
prodperday <- data.frame(Date=as.Date(timesindex),
prod1=replicate(length(timesindex),0),
prod2=replicate(length(timesindex),0),
stringsAsFactors=FALSE)
# loop over all entries and separate them into produced fractions per day
for (irow in 1:dim(mydata)[1]){
timestart = mydata[irow,"startdatetime"]
datestart = as.Date(format(timestart, format = "%Y/%m/%d"))
timeend = timestart
tota_run_time_in_h = (as.numeric((mydata[irow,"enddatetime"]-mydata[irow,"startdatetime"])))*24.
while (timeend < mydata[irow,"enddatetime"]){
timeend = min (as.POSIXct(datestart, format = "%Y/%m/%d %H:%M:%S")+23*60*60-1,
mydata[irow,"enddatetime"])
tdiff = as.numeric(timeend-timestart)
fraction_prod = (tdiff/tota_run_time_in_h)*mydata[irow,"amount"]
if (datestart %in% prodperday$Date){
prodperday[prodperday$Date == datestart,as.character(mydata[irow,"type"])] =
prodperday[prodperday$Date == datestart,as.character(mydata[irow,"type"])] + fraction_prod
}
timestart = timeend+1
datestart = as.Date(format(timestart, format = "%Y/%m/%d"))
timeend = timestart
}
}
and the result:
Date prod1 prod2
1 2019-02-02 470.5828 0
2 2019-02-03 1836.5741 0
3 2019-02-04 2352.9139 0
4 2019-02-05 939.5425 1126280
Upvotes: 0
Views: 149
Reputation: 5893
Here's what I would do:
You know that the start date uses 24-starttime
production hours. The end date uses endtime
hours, and all days inbetween obviously use 24 hours. So it is easy to calculate.
library(lubridate)
library(tidyverse)
pmap_dfr(mydata, ~ {
hours <- abs(as.numeric(difftime(..1, ..2, units = "hours")))
day_seq <- seq(as_date(..1), as_date(..2), by = "days")
hours_start <- hour(..1) + minute(..1) / 60
hours_end <- hour(..2) + minute(..2) / 60
production <- c(
..4 * (24 - hours_start) / hours,
rep(..4 * 24 / hours, max(length(day_seq) - 2, 0)),
..4 * hours_end / hours
)
tibble(
day = day_seq,
amount = production,
type = ..3
)
}) %>%
group_by(day, type) %>%
summarise(amount = sum(amount)) %>%
spread(type, amount) %>%
replace_na(list(prod1 = 0, prod2 = 0))
# A tibble: 6 x 3
# Groups: day [6]
day prod1 prod2
<date> <dbl> <dbl>
1 2019-02-01 400. 0
2 2019-02-02 471. 0
3 2019-02-03 1837. 0
4 2019-02-04 2353. 0
5 2019-02-05 940. 27031250
6 2019-02-06 1300. 2968750.
The very first and last entries can then be deleted in the end if you want to do that.
Upvotes: 1
Reputation: 1714
The solution I propopose is not perfect because there are problems with boundaries but the idea to transform your data in production by hour and after aggregated them by day is probably a good idea.
I use as whished the two libraries :
library(lubridate)
library(dplyr)
The time of reference :
ref.times <- seq(from = min(mydata$startdatetime),
to = max(mydata$enddatetime),
by = "hour")
Construction of a database with production by hour :
newdata <- data.frame(hour = floor_date(ref.times, unit = "hour"),
prod1 = 0,
prod2 = 0,
day = floor_date(newdata$hour, unit= "day"))
for(i in 1:nrow(mydata)){
ref.times <- seq(from = mydata$startdatetime[i],
to = mydata$enddatetime[i],
by = "hour")
n <- length(floor_date(ref.times, "hour"))
if(mydata[i, 3] == "prod1"){
newdata[newdata$hour %in% floor_date(ref.times, unit = "hour"), 2] <-
rep(mydata[i, 4] / n, n)
}else{
newdata[newdata$hour %in% floor_date(ref.times, unit = "hour"), 3] <-
rep(mydata[i, 4] / n, n)
}
}
Aggregation by day :
newdata %>% group_by(day) %>% summarise(prod1 = sum(prod1),
prod2 = sum(prod2))
Upvotes: 1