Reputation: 821

Memory problems using bigmemory to load large dataset in R

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

where x.csv contains

Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"

Can big.matrix objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!

Upvotes: 4

Answers (1)

Steve Bronder

Reputation: 946

is there a better way to load, process and then save these data? The current method is slow!

The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.

Here's the plan:

Figure out how many lines we have in a file
Read in a chunk of those lines
Perform some operation on that chunk

Push that chunk back into a new file to save for later

library(readr) 
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))

write_csv(x,"./test_set2.csv")

# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
  # Set up variables
  num.lines <- 0
  lines.per <- NULL
  var.top <- NULL
  i=0L

  # Gather column names and position of objective column
  connection.names <- file(calc.file,open="r+")
  data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
  close(connection.names)
  col.name <- which(colnames(data.names)==variable)

  #Find length of file by line
  connection.len <- file(calc.file,open="r+")
  while((linesread <- length(readLines(connection.len,read.size)))>0){

    lines.per[i] <- linesread
    num.lines <- num.lines + linesread
    i=i+1L 
  }
  close(connection.len)

  # Make connection for doubling function
  # Loop through file and double the set variables
  connection.double <- file(calc.file,open="r+")
  for (j in 1:length(lines.per)){

    # if stops read.table from breaking
    # Read in a chunk of the file
    if (j == 1) {
      data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
    } else {
      data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
    }
      # Grab the columns we need and double them
      double <- data[,I(col.name)] * 2
    if (j != 1) {
      write_csv(data.frame(double),outputFile,append = TRUE)
    } else {
      write_csv(data.frame(double),outputFile)
    }

    message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
  }
  close(connection.double)
}

calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")

So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

Upvotes: 2

Memory problems using bigmemory to load large dataset in R

Answers (1)

Related Questions