Reputation: 997

R group variables in days calculated from interval between start and end time

I have a dataframe as below

tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),  
starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00",  "2015-09-18 13:00",  "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))
    tmpdf
      licensePlate        starttime          endtime
    1       Y80901 2015-09-18 09:55 2015-09-18 17:55
    2       Y80901 2015-09-18 23:00 2015-09-20 11:00
    3       Y80901 2015-09-20 15:00 2015-09-21 12:00
    4      AMG-999 2015-09-17 15:42 2015-09-18 13:00
    5      AMG-999 2015-09-21 09:22 2015-09-21 14:22
    6        W3188 2015-09-17 09:00 2015-09-21 12:00
    7        W3188 2015-09-21 14:00 2015-09-21 16:00

I want to count the last n days (for example, last 5 days from 9/17 to 9/21) in time used per day of each licensePlate, My expected result is as below:

   Period            LicensePlate        Used Time   

1 2015-09-17         Y80901              0
2 2015-09-17         AMG-999             8.3     
3 2015-09-17         W3188               15
4 2015-09-18         Y80901              9
5 2015-09-18         AMG-999             13
6 2015-09-18         W3188               24
7 2015-09-19         Y80901              24
8 2015-09-19         AMG-999             0
9 2015-09-19         W3188               24
10 2015-09-20        Y80901              20
11 2015-09-20        AMG-999             0
12 2015-09-20        W3188               24
13 2015-09-21        Y80901              12
14 2015-09-21        AMG-999             5
15 2015-09-21        W3188               14

I think dplyr/data.table and lubridate can be used to obtain my result, and I may need to measure time period in days, but I don't know how to cut within a start/end interval when start/end will differ per row.

Upvotes: 5

Answers (4)

pauljeba

Reputation: 770

Take a deep breath. Here is my solution

Initializing Data

tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),  
                starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
                endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00",  "2015-09-18 13:00",  "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))

'converting to POSIXct for better date/time handling'
    tmpdf$starttime <- as.POSIXct(tmpdf$starttime, tz = "GMT")
    tmpdf$endtime <- as.POSIXct(tmpdf$endtime, tz = "GMT")

Data Preparation

To perform the required operations your complete usage data has to be converted to daily usage data. So I have written the following function to prepare the data into a required format.

#splits single usage data into two  
splitToTwo <- function(list){
newList <- NULL

for ( i in 1:nrow(list)){

 tmp <- list[i,]

 # set the end time of the first split as 23:59:59
 list[i,]$endtime <- as.Date(list[i,]$starttime) + hours(23) + minutes(59) + seconds(59)

 # set the start time of the second split as 00:00:01
 tmp$starttime <- list[i,]$endtime + seconds(2)

 # add the new df to the list  
 tmp <-  rbind(tmp,list[i,])
 newList <- rbind(newList,tmp)
 }
 return(newList)
}


#recursive function. Split the usage data into two till there are completely normalised to daily usage data
setDailyUsage <- function(tmpdf){

  # create a exclusive subset where the usage spawns more than a day   
  list <- tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0,   ]

  # replace tmpdf with usage that started and ended the same day   
  tmpdf <- tmpdf[ as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) == 0,]  

  # call to our split function to split the dataset with usage spawning more than one day  
  split <- splitToTwo(list)

  # add the now split data to our exclusive
  tmpdf <- rbind(tmpdf,split)

  if (nrow(tmpdf[as.Date(tmpdf$endtime) - as.Date(tmpdf$starttime) > 0,   ])>0){
      tmpdf <- setDailyUsage(tmpdf)
  }

return(tmpdf)

}

Prepared Data

Our prepared data

preparedData <- setDailyUsage(tmpdf)
    licensePlate           starttime             endtime
1         Y80901 2015-09-18 09:55:00 2015-09-18 17:55:00
5        AMG-999 2015-09-21 09:22:00 2015-09-21 14:22:00
7          W3188 2015-09-21 14:00:00 2015-09-21 16:00:00
21        Y80901 2015-09-18 23:00:00 2015-09-18 23:59:59
3         Y80901 2015-09-21 00:00:01 2015-09-21 12:00:00
31        Y80901 2015-09-20 15:00:00 2015-09-20 23:59:59
4        AMG-999 2015-09-18 00:00:01 2015-09-18 13:00:00
41       AMG-999 2015-09-17 15:42:00 2015-09-17 23:59:59
61         W3188 2015-09-17 09:00:00 2015-09-17 23:59:59
2         Y80901 2015-09-20 00:00:01 2015-09-20 11:00:00
211       Y80901 2015-09-19 00:00:01 2015-09-19 23:59:59
611        W3188 2015-09-18 00:00:01 2015-09-18 23:59:59
612        W3188 2015-09-19 00:00:01 2015-09-19 23:59:59
6          W3188 2015-09-21 00:00:01 2015-09-21 12:00:00
613        W3188 2015-09-20 00:00:01 2015-09-20 23:59:59

Data Manipulation

Now we create a new DF which represents the data in the required format. This will initially have empty values in the UsedTime column.

preparedData$duration <- preparedData$endtime - preparedData$starttime
noOfUniquePlates <- length(unique(preparedData$licensePlate))
Period <- rep(seq(from=(min(as.Date(preparedData$starttime))),to=(max(as.Date(preparedData$starttime))), by="day"),noOfUniquePlates)
noOfUniqueDays <- length(unique(Period))
LicensePlate <- rep(unique(preparedData$licensePlate),each=noOfUniqueDays)
UsedTime <- 0

newDF <- data.frame(Period,LicensePlate,UsedTime)

Now a simply mapply function on each row of the newDF , searching for the right usage data in preparedData df.

findUsage <- function(p,l){
  sum(preparedData[as.Date(preparedData$starttime) == p & as.Date(preparedData$endtime) == p & preparedData$licensePlate == l ,  ]$duration)
}
newDF$UsedTime <- mapply( findUsage, newDF$Period, newDF$LicensePlate)
newDF$UsedTime <- newDF$UsedTime/60

    > newDF[with(newDF,order(Period)),]
       Period LicensePlate  UsedTime
1  2015-09-17       Y80901  0.000000
6  2015-09-17      AMG-999  8.299722
11 2015-09-17        W3188 14.999722
2  2015-09-18       Y80901  8.999722
7  2015-09-18      AMG-999 12.999722
12 2015-09-18        W3188 23.999444
3  2015-09-19       Y80901 23.999444
8  2015-09-19      AMG-999  0.000000
13 2015-09-19        W3188 23.999444
4  2015-09-20       Y80901 19.999444
9  2015-09-20      AMG-999  0.000000
14 2015-09-20        W3188 23.999444
5  2015-09-21       Y80901 11.999722
10 2015-09-21      AMG-999  5.000000
15 2015-09-21        W3188 13.999722

I had to limit explanations to keep the answer short. Let me know if you need any clarification in comments.

Upvotes: 1

David Arenburg

Reputation: 92292

Here's something to get you started. This is almost your desired output, as it doesn't show you the missing licensePlate per period.

The first step would be to convert your date to valid POSIXct class, then expand the data to a per minute level (probably the most costly part in this solution) and the aggregate by licensePlate and by Period while summing up the results (I'm not using as.Date here because it handles badly POSIX values with values between 00 and 1 am).

library(data.table)
setDT(tmpdf)[, `:=`(starttime = as.POSIXct(starttime), endtime = as.POSIXct(endtime))]
res <- tmpdf[, .(licensePlate, Period = seq(starttime, endtime, by = "1 min")), by = 1:nrow(tmpdf)]
res[, .(Used_Time = round(.N/60L, 1L)), keyby = .(Period = substr(Period, 1L, 10L), licensePlate)]
#         Period licensePlate Used_Time
#  1: 2015-09-17      AMG-999       8.3
#  2: 2015-09-17        W3188      15.0
#  3: 2015-09-18      AMG-999      13.0
#  4: 2015-09-18        W3188      24.0
#  5: 2015-09-18       Y80901       9.0
#  6: 2015-09-19        W3188      24.0
#  7: 2015-09-19       Y80901      24.0
#  8: 2015-09-20        W3188      24.0
#  9: 2015-09-20       Y80901      20.0
# 10: 2015-09-21      AMG-999       5.0
# 11: 2015-09-21        W3188      14.0
# 12: 2015-09-21       Y80901      12.0

Upvotes: 3

Ajay Ohri

Reputation: 3492

try this- does it help

`tmpdf <- data.frame(licensePlate = c("Y80901", "Y80901", "Y80901", "AMG-999", "AMG-999", "W3188", "W3188"),  
                    starttime= c("2015-09-18 09:55", "2015-09-18 23:00", "2015-09-20 15:00", "2015-09-17 15:42", "2015-09-21 09:22", "2015-09-17 09:00", "2015-09-21 14:00"),
                    endtime = c("2015-09-18 17:55", "2015-09-20 11:00", "2015-09-21 12:00",  "2015-09-18 13:00",  "2015-09-21 14:22", "2015-09-21 12:00", "2015-09-21 16:00"))

tmpdf
str(tmpdf)
library(lubridate)
tmpdf$starttime=ymd_hm(paste(tmpdf$starttime))
tmpdf$endtime=ymd_hm(paste(tmpdf$endtime))
tmpdf$Period=day(tmpdf$starttime)
tmpdf$diff=difftime(tmpdf$endtime,tmpdf$starttime)
tmpdf`

Upvotes: 0

rhozzy

Reputation: 352

You are correct that plyr can be used to solve this problem. One possible solution:

tmpdf$starttime <- as.POSIXct(tmpdf$starttime) #convert date/time columns to date/time values in R
tmpdf$endtime <- as.POSIXct(tmpdf$endtime) #convert date/time columns to date/time values in R
newdf <- ddply(tmpdf,.(as.Date(starttime),licensePlate),function(df){
df$diffdays <- as.double(difftime(df$endtime,df$starttime,units='days'))
df
})
#If you want to only have the Period, LicensePlate, and Used Time columns remaining:
newdf <- subset(newdf,select=c(1,2,5))
colnames(newdf) <- c('Period','LicensePlate','UsedTime')

Hope it helps!