How to speed up loading data into R?

Question

The problem:

Data sets take 6-12 hours to load into R. Much larger data sets are coming, and my current import process clearly isn't ready for them. Once it's all in a data frame the size isn't a problem; most operations take only a few seconds, so my hardware probably isn't the issue.

Note: This question is not a duplicate of similar questions because I have already implemented most of the advice from related threads, e.g. specify colClasses.

The data:

Rows in the tab-delimitted text files look like this:

20  -0.5    1   2   1   1   19  0   119 30  exp(-31.3778)

Loading the data:

I have defined a couple functions that together loop over the files to load the data into a single data frame then save it as a blob. This is the process that takes hours. The process predictably slows down and uses more memory as it progresses; top indicates that R is using > 95% of the CPU and (more importantly?) > 1.5 GB of real memory by the time it's halfway through the data files.

# get numeric log from character data
extract_log <- function(x) {
  expr <- "exp$(.*)$"
  substring <- sub(expr, "\1", x)
  log <- as.numeric(substring)
  return(log)

# reads .dat files into data frames
read_dat <- function(x, colClasses = c(rep("numeric", 10), "character")) {
  df <- read.table(x, header = TRUE, sep = "\t", comment.char = "",
                   colClasses = colClasses)
  df <- cbind(df, log_likelihood = sapply(df$likelihood, extract_log))
  df$likelihood <- exp(df$log_likelihood)
  # drop nat. log col, add log10 column shifting data to max = 0
  df <- transform(df,
                  rlog_likelihood = log10(likelihood) - max(log10(likelihood)))
  return(df)
}

# creates a single data frame from many .dat files
df_blob <- function(path = getwd(), filepattern = "*.dat$",
                    outfile = 'df_blob.r', ...) {
  files <- list.files(path = path, pattern = filepattern, full.names = TRUE)
  progress_bar <- {
    txtProgressBar(min = 0, max = length(files),
                    title = "Progress",
                    style = 3)
  }
  df <- read_dat(files[1])
  setTxtProgressBar(progress_bar, 1)
  for (f in 2:length(files)) {
    df <- rbind(df, read_dat(files[f]))
    setTxtProgressBar(progress_bar, f)
  }
  close(progress_bar)
  save(df, file = outfile)
}

The Solution

Time required has been reduced from hours to seconds.

Concatenate the data files with a shell script (time required ~12 seconds)
Load the concatenated file with sqldf (time required ~6 seconds)

Concatenate the data files with a shell script (time required ~12 seconds) and then load them with sqldf() exactly as described in JD Long's answer to a related question and as described in his blog post.

Lessons learned

Comments by Justin and Joran significantly improved the efficiency of my read.table() approach and for smaller data sets, that approach should work fine. In particular Justin's advice to replace the looping of rbind(df, read_dat(files[f])) over files with do.call(rbind, lapply(files, read_dat)) cut the execution time by about 2/3. Improvements from other suggestions were more modest though still worth while.

Richie Cotton · Accepted Answer

~~The fundamental~~ A big problem that you have is that read.table isn't very fast. You can tweak it by setting colClasses and nrows, but at the end of the day, if your data takes 12 hours to load, you need to use different technology.

A faster approach is to import your data into a database and then read it into R. JD Long demonstrates a method using a sqlite database and the sqldf package in this answer. MonetDB and the MonetDB.R package are designed for doing this sort of thing very quickly and are worth investigating.

As Justin and Joran both spotted, incrementally growing a data frame in a loop using rbind(df, read_dat(files[f])) is a huge bottleneck. Where the full dataset fits in RAM, a far better appraoch is to use do.call(files, read.table). (Where it doesn't, use the above method of erading everything into a database and just pulling what you need into R.)