Reputation: 1575
I have been encountering a strange performance issue with R.
I have a csv file that contains close to 600,00 lines and 11 columns. The last column contains dates. I am trying filter rows based on whether date in the last column is a weekend or weekday. As you can see from the output below, it takes 12 seconds for this relatively simple filtering.
> library(lubridate)
> data335 = read.csv("data335.csv")
> Sys.time()
[1] "2017-10-29 00:50:16 IST"
> delete_variable = data335[ifelse((wday(data335$ticket_date) %in% c("1","6")), T , F),][11]
> Sys.time()
[1] "2017-10-29 00:50:28 IST"
However, filtering on other column values hardly takes a second or two.
> Sys.time()
[1] "2017-10-29 00:58:58 IST"
> delete_variable = data335[(data335$route_no == "V-335EUP") ,][11]
> Sys.time()
[1] "2017-10-29 00:58:58 IST"
I'm sure, in the earlier filtering case, I am not doing it in the R way. Is there a way to bring this time taken to filter within 2 seconds?
Upvotes: 1
Views: 72
Reputation: 1494
On my machine, your original code ran in ~7 seconds. I noticed that data335$ticket_date
was stored as a factor, so I read it in as a string and coerced it to date format. Time dropped to .1 second.
Also took out the if_else statement, because %in%
already returns a logical vector. And used numeric instead of character for the c(1,7)
(you had c("1", "6"), but if you are looking for weekends, I think you want 1 & 7). Those resulted in minor speed improvements.
library(lubridate)
data335 <- read.csv('Downloads/data335.csv', stringsAsFactors=FALSE)
data335$ticket_date <- as.Date(data335$ticket_date, format="%d-%m-%Y")
start <- Sys.time()
delete_variable = data335[wday(data335$ticket_date) %in% c(1,7),][11]
end <- Sys.time()
end-start
Upvotes: 4