Reputation: 33

Optimizing For loop with nested if in R

I am trying to merge multiple csv files into a single dataframe and trying to manipulate the resultant dataframe using a for loop. The resultant dataframe may have anywhere between 1,500,000 to 2,000,000 rows.

I am using the below code for the same.

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

However the code is running for very long. Is there a better and faster way of doing the same operation?

Upvotes: 2

Answers (3)

akrun

Reputation: 887128

We can make this more optimized by reading with fread from data.table and then using cut/findInterval. This will become more pronounced when it is run in multiple cores, nodes on a server where fread utilize all the nodes and execute parallelly

library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

Upvotes: 2

Daniel

Reputation: 33

Akrun is indeed right that fread is substantially faster read.csv.

However, in addition to his post, I would also add that your for loop is totally unnecessary. He replaced it with cut/findInterval, which I am not familiar with. In terms of simple R programming though, for loops are necessary when some factor in your calculation is changing by row. However, in your code, this is not the case and there is no need for a for loop.

Essentially you are running a calculation up to 2 million times when you only need to run the calculation over the column once.

You can replace your for loop with something like this:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

and your code will run waaaaaay faster

Upvotes: 0

ulfelder

Reputation: 5335

You're already using dplyr, so why not just:

Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

Upvotes: 1

Optimizing For loop with nested if in R

Answers (3)

Related Questions