Reputation: 33
I have a dataset with +1M rows. It has a start_date and an end_date using the format "aaaa-mm-dd hh-mm-ss"
.
I want to add a new column to the dataset with the duration of the time between the end and start date for each row.
So far, I'm able to get the time difference using the difftime
function:
difftime("2020-11-01 13:45:40", "2020-11-01 13:36:00", units = "mins")
This gets me the following output: Time difference of 9.666667 mins
, which I would like to replicate for the entire +1M-rows dataset.
For my test I'm working with a small tibble. I tried using the mutate
function with rowwise
and list
. My code goes as follows:
rowwise() %>%
mutate(trip_duration = list((difftime(as.Date(df$`end time`), as.Date(df$`start time`), units = "mins"))))
This provides the following output:
# A tibble: 3 × 3
# Rowwise:
`start time` `end time` trip_duration
<chr> <chr> <list>
1 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
2 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
3 2020-11-01 13:36:00 2020-11-01 13:45:40 <drtn [3]>
The new column doesn't show what I'm looking for, it just shows the number 3
for each row no matter if I ask for the result in mins
, secs
, or even hours
, and now I'm stuck trying to figure out the way to do the calculation.
Thanks in advance to anyone able to help, cheers!
Upvotes: 2
Views: 2627
Reputation: 1285
I know this has been answered, but I wanted to add a lubridate
approach.
First, import and ensure your columns are in the correct format:
library(lubridate)
df$`end time`<- lubridate::as_datetime(df$`end time`)
df$`start time` <- lubridate::as_datetime(df$`start time`)
Then simply add the column (ensure you put the latest date first, or you'll get a negative number):
df$trip_duration <- time_length(df$`end time` - df$`start time`, unit="days")
Upvotes: 0
Reputation: 389325
You need to make some changes in your code.
First and foremost, don't use $
in dplyr
pipes. Pipes (%>%
) were created to avoid using df$column_name
everytime you want to use variable from the dataframe. Using $
can have unintended consequences when grouping the data or using rowwise
as you can see in your case.
Secondly, difftime
is vectorised so no need of rowwise
here.
Finally, if you want time difference in minutes you should change the values to POSIXct
type and not dates. Try the following -
library(dplyr)
df <- df %>%
mutate(trip_duration = difftime(as.POSIXct(`end time`),
as.POSIXct(`start time`), units = "mins"))
Upvotes: 2