Reputation: 997
I have a dataframe as below
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
tmpdf
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55 2015-09-18 17:55
2 Y80901 2015-09-18 23:00 2015-09-20 11:00
3 Y80901 2015-09-20 15:00 2015-09-21 12:00
4 AMG-999 2015-09-17 15:42 2015-09-18 13:00
5 AMG-999 2015-09-21 09:22 2015-09-21 14:22
6 W3188 2015-09-17 09:00 2015-09-21 12:00
7 W3188 2015-09-21 14:00 2015-09-21 16:00
I want to count the last n days (for example, last 5 days from 9/17 to 9/21) in time used per day of each licensePlate, My expected result is as below:
Period LicensePlate Used Time
1 2015-09-17 Y80901 0
2 2015-09-17 AMG-999 8.3
3 2015-09-17 W3188 15
4 2015-09-18 Y80901 9
5 2015-09-18 AMG-999 13
6 2015-09-18 W3188 24
7 2015-09-19 Y80901 24
8 2015-09-19 AMG-999 0
9 2015-09-19 W3188 24
10 2015-09-20 Y80901 20
11 2015-09-20 AMG-999 0
12 2015-09-20 W3188 24
13 2015-09-21 Y80901 12
14 2015-09-21 AMG-999 5
15 2015-09-21 W3188 14
I think dplyr/data.table and lubridate can be used to obtain my result, and I may need to measure time period in days, but I don't know how to cut within a start/end interval when start/end will differ per row.
Upvotes: 5
Views: 938
Reputation: 770
Take a deep breath. Here is my solution
tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
'converting to POSIXct for better date/time handling'
tmpdf$starttime <- as.POSIXct(tmpdf$starttime, tz = "GMT")
tmpdf$endtime <- as.POSIXct(tmpdf$endtime, tz = "GMT")
To perform the required operations your complete usage data has to be converted to daily usage data. So I have written the following function to prepare the data into a required format.
#splits single usage data into two
splitToTwo <- function(list){
newList <- NULL
for ( i in 1:nrow(list)){
tmp <- list[i,]
# set the end time of the first split as 23:59:59
list[i,]$endtime <- as.Date(list[i,]$starttime) + hours(23) + minutes(59) + seconds(59)
# set the start time of the second split as 00:00:01
tmp$starttime <- list[i,]$endtime + seconds(2)
# add the new df to the list
tmp <- rbind(tmp,list[i,])
newList <- rbind(newList,tmp)
}
return(newList)
}
#recursive function. Split the usage data into two till there are completely normalised to daily usage data
setDailyUsage <- function(tmpdf){
# create a exclusive subset where the usage spawns more than a day
list <- tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ]
# replace tmpdf with usage that started and ended the same day
tmpdf <- tmpdf[ as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) == 0,]
# call to our split function to split the dataset with usage spawning more than one day
split <- splitToTwo(list)
# add the now split data to our exclusive
tmpdf <- rbind(tmpdf,split)
if (nrow(tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0, ])>0){
tmpdf <- setDailyUsage(tmpdf)
}
return(tmpdf)
}
Our prepared data
preparedData <- setDailyUsage(tmpdf)
licensePlate starttime endtime
1 Y80901 2015-09-18 09:55:00 2015-09-18 17:55:00
5 AMG-999 2015-09-21 09:22:00 2015-09-21 14:22:00
7 W3188 2015-09-21 14:00:00 2015-09-21 16:00:00
21 Y80901 2015-09-18 23:00:00 2015-09-18 23:59:59
3 Y80901 2015-09-21 00:00:01 2015-09-21 12:00:00
31 Y80901 2015-09-20 15:00:00 2015-09-20 23:59:59
4 AMG-999 2015-09-18 00:00:01 2015-09-18 13:00:00
41 AMG-999 2015-09-17 15:42:00 2015-09-17 23:59:59
61 W3188 2015-09-17 09:00:00 2015-09-17 23:59:59
2 Y80901 2015-09-20 00:00:01 2015-09-20 11:00:00
211 Y80901 2015-09-19 00:00:01 2015-09-19 23:59:59
611 W3188 2015-09-18 00:00:01 2015-09-18 23:59:59
612 W3188 2015-09-19 00:00:01 2015-09-19 23:59:59
6 W3188 2015-09-21 00:00:01 2015-09-21 12:00:00
613 W3188 2015-09-20 00:00:01 2015-09-20 23:59:59
Now we create a new DF which represents the data in the required format. This will initially have empty values in the UsedTime column.
preparedData$duration <- preparedData$endtime - preparedData$starttime
noOfUniquePlates <- length(unique(preparedData$licensePlate))
Period <- rep(seq(from=(min(as.Date(preparedData$starttime))),to=(max(as.Date(preparedData$starttime))), by="day"),noOfUniquePlates)
noOfUniqueDays <- length(unique(Period))
LicensePlate <- rep(unique(preparedData$licensePlate),each=noOfUniqueDays)
UsedTime <- 0
newDF <- data.frame(Period,LicensePlate,UsedTime)
Now a simply mapply function on each row of the newDF , searching for the right usage data in preparedData df.
findUsage <- function(p,l){
sum(preparedData[as.Date(preparedData$starttime) == p & as.Date(preparedData$endtime) == p & preparedData$licensePlate == l , ]$duration)
}
newDF$UsedTime <- mapply( findUsage, newDF$Period, newDF$LicensePlate)
newDF$UsedTime <- newDF$UsedTime/60
> newDF[with(newDF,order(Period)),]
Period LicensePlate UsedTime
1 2015-09-17 Y80901 0.000000
6 2015-09-17 AMG-999 8.299722
11 2015-09-17 W3188 14.999722
2 2015-09-18 Y80901 8.999722
7 2015-09-18 AMG-999 12.999722
12 2015-09-18 W3188 23.999444
3 2015-09-19 Y80901 23.999444
8 2015-09-19 AMG-999 0.000000
13 2015-09-19 W3188 23.999444
4 2015-09-20 Y80901 19.999444
9 2015-09-20 AMG-999 0.000000
14 2015-09-20 W3188 23.999444
5 2015-09-21 Y80901 11.999722
10 2015-09-21 AMG-999 5.000000
15 2015-09-21 W3188 13.999722
I had to limit explanations to keep the answer short. Let me know if you need any clarification in comments.
Upvotes: 1
Reputation: 92292
Here's something to get you started. This is almost your desired output, as it doesn't show you the missing licensePlate
per period.
The first step would be to convert your date to valid POSIXct
class, then expand the data to a per minute level (probably the most costly part in this solution) and the aggregate by licensePlate
and by Period
while summing up the results (I'm not using as.Date
here because it handles badly POSIX
values with values between 00 and 1 am).
library(data.table)
setDT(tmpdf)[, `:=`(starttime = as.POSIXct(starttime), endtime = as.POSIXct(endtime))]
res <- tmpdf[, .(licensePlate, Period = seq(starttime, endtime, by = "1 min")), by = 1:nrow(tmpdf)]
res[, .(Used_Time = round(.N/60L, 1L)), keyby = .(Period = substr(Period, 1L, 10L), licensePlate)]
# Period licensePlate Used_Time
# 1: 2015-09-17 AMG-999 8.3
# 2: 2015-09-17 W3188 15.0
# 3: 2015-09-18 AMG-999 13.0
# 4: 2015-09-18 W3188 24.0
# 5: 2015-09-18 Y80901 9.0
# 6: 2015-09-19 W3188 24.0
# 7: 2015-09-19 Y80901 24.0
# 8: 2015-09-20 W3188 24.0
# 9: 2015-09-20 Y80901 20.0
# 10: 2015-09-21 AMG-999 5.0
# 11: 2015-09-21 W3188 14.0
# 12: 2015-09-21 Y80901 12.0
Upvotes: 3
Reputation: 3492
try this- does it help
`tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00", "2015-09-18 13:00", "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
tmpdf
str(tmpdf)
library(lubridate)
tmpdf$starttime=ymd_hm(paste(tmpdf$starttime))
tmpdf$endtime=ymd_hm(paste(tmpdf$endtime))
tmpdf$Period=day(tmpdf$starttime)
tmpdf$diff=difftime(tmpdf$endtime,tmpdf$starttime)
tmpdf`
Upvotes: 0
Reputation: 352
You are correct that plyr can be used to solve this problem. One possible solution:
tmpdf$starttime <- as.POSIXct(tmpdf$starttime) #convert date/time columns to date/time values in R
tmpdf$endtime <- as.POSIXct(tmpdf$endtime) #convert date/time columns to date/time values in R
newdf <- ddply(tmpdf,.(as.Date(starttime),licensePlate),function(df){
df$diffdays <- as.double(difftime(df$endtime,df$starttime,units='days'))
df
})
#If you want to only have the Period, LicensePlate, and Used Time columns remaining:
newdf <- subset(newdf,select=c(1,2,5))
colnames(newdf) <- c('Period','LicensePlate','UsedTime')
Hope it helps!
Upvotes: 0