R filtering taking abnormally high time to load

Question

I have been encountering a strange performance issue with R.

I have a csv file that contains close to 600,00 lines and 11 columns. The last column contains dates. I am trying filter rows based on whether date in the last column is a weekend or weekday. As you can see from the output below, it takes 12 seconds for this relatively simple filtering.

> library(lubridate)
> data335 = read.csv("data335.csv")
> Sys.time()
[1] "2017-10-29 00:50:16 IST"
> delete_variable = data335[ifelse((wday(data335$ticket_date) %in% c("1","6")), T , F),][11]
> Sys.time()
[1] "2017-10-29 00:50:28 IST"

However, filtering on other column values hardly takes a second or two.

> Sys.time()
[1] "2017-10-29 00:58:58 IST"
> delete_variable = data335[(data335$route_no == "V-335EUP")  ,][11]
> Sys.time()
[1] "2017-10-29 00:58:58 IST"

I'm sure, in the earlier filtering case, I am not doing it in the R way. Is there a way to bring this time taken to filter within 2 seconds?

Alex P · Accepted Answer

On my machine, your original code ran in ~7 seconds. I noticed that data335$ticket_date was stored as a factor, so I read it in as a string and coerced it to date format. Time dropped to .1 second.

Also took out the if_else statement, because %in% already returns a logical vector. And used numeric instead of character for the c(1,7) (you had c("1", "6"), but if you are looking for weekends, I think you want 1 & 7). Those resulted in minor speed improvements.

library(lubridate)
data335 <- read.csv('Downloads/data335.csv', stringsAsFactors=FALSE)
data335$ticket_date <- as.Date(data335$ticket_date, format="%d-%m-%Y")

start <- Sys.time()
delete_variable = data335[wday(data335$ticket_date) %in% c(1,7),][11]
end <- Sys.time()
end-start

R filtering taking abnormally high time to load

Answers (1)

Related Questions