Reputation: 33
I am trying to merge multiple csv files into a single dataframe and trying to manipulate the resultant dataframe using a for loop. The resultant dataframe may have anywhere between 1,500,000 to 2,000,000 rows.
I am using the below code for the same.
setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path)
{
files = dir(path, pattern = '\\.csv', full.names = TRUE)
tables = lapply(files, read_csv)
do.call(rbind, tables)
}
Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)
for (i in 1:nrow(Data1))
{
Data1$Category[i] = ""
Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
ifelse(Data1$Days[i] <= 60, "31-60",
ifelse(Data1$Days[i] <= 90, "61-90",">90")))
}
However the code is running for very long. Is there a better and faster way of doing the same operation?
Upvotes: 2
Views: 102
Reputation: 887128
We can make this more optimized by reading with fread
from data.table
and then using cut/findInterval
. This will become more pronounced when it is run in multiple cores, nodes on a server where fread
utilize all the nodes and execute parallelly
library(data.table)
merge_data <- function(path) {
files = dir(path, pattern = '\\.csv', full.names = TRUE)
rbindlist(lapply(files, fread, select = c(8, 9, 17)))
}
Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf),
labels = c("<=30", "31-60", "61-90", ">90"))]
Upvotes: 2
Reputation: 33
Akrun is indeed right that fread is substantially faster read.csv.
However, in addition to his post, I would also add that your for loop is totally unnecessary. He replaced it with cut/findInterval, which I am not familiar with. In terms of simple R programming though, for loops are necessary when some factor in your calculation is changing by row. However, in your code, this is not the case and there is no need for a for loop.
Essentially you are running a calculation up to 2 million times when you only need to run the calculation over the column once.
You can replace your for loop with something like this:
Data1$category = ifelse(Data1$Days <= 30, "<=30",
ifelse(Data1$Days <= 60, "31-60",
ifelse(Data1$Days <= 90, "61-90",">90")))
and your code will run waaaaaay faster
Upvotes: 0
Reputation: 5335
You're already using dplyr
, so why not just:
Data = merge_data("D:/Projects") %>%
select(8, 9, 17) %>%
mutate(Category = cut(Days,
breaks = c(-Inf, 30, 60, 90, Inf),
labels = c("<=30", "31-60", "61-90", ">90"))
Upvotes: 1