slabofguinness
slabofguinness

Reputation: 821

Memory problems using bigmemory to load large dataset in R

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:

library(bigmemory)
library(pryr)

con  <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')

for (i in 1:5){
   print(c(address(x), refs(x)))
   y <- readLines(con, n = 1, warn = FALSE)
   x[i] <- 2L*as.integer(y)
} 

close(con)

where x.csv contains

4
18
2
14
16

Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:

[1] "0x101e854d8" "2"          
[1] "0x101d8f750" "2"          
[1] "0x102380d80" "2"          
[1] "0x105a8ff20" "2"          
[1] "0x105ae0d88" "2"   
  1. Can big.matrix objects be modified in place?

  2. is there a better way to load, process and then save these data? The current method is slow!

Upvotes: 4

Views: 831

Answers (1)

Steve Bronder
Steve Bronder

Reputation: 946

  1. is there a better way to load, process and then save these data? The current method is slow!

The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.

Here's the plan:

  1. Figure out how many lines we have in a file
  2. Read in a chunk of those lines
  3. Perform some operation on that chunk
  4. Push that chunk back into a new file to save for later

    library(readr) 
    # Make a file
    x <- data.frame(matrix(rnorm(10000),100000,10))
    
    write_csv(x,"./test_set2.csv")
    
    # Create a function to read a variable in file and double it
    calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
    read.size=500000,variable="X1"){
      # Set up variables
      num.lines <- 0
      lines.per <- NULL
      var.top <- NULL
      i=0L
    
      # Gather column names and position of objective column
      connection.names <- file(calc.file,open="r+")
      data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
      close(connection.names)
      col.name <- which(colnames(data.names)==variable)
    
      #Find length of file by line
      connection.len <- file(calc.file,open="r+")
      while((linesread <- length(readLines(connection.len,read.size)))>0){
    
        lines.per[i] <- linesread
        num.lines <- num.lines + linesread
        i=i+1L 
      }
      close(connection.len)
    
      # Make connection for doubling function
      # Loop through file and double the set variables
      connection.double <- file(calc.file,open="r+")
      for (j in 1:length(lines.per)){
    
        # if stops read.table from breaking
        # Read in a chunk of the file
        if (j == 1) {
          data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
        } else {
          data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
        }
          # Grab the columns we need and double them
          double <- data[,I(col.name)] * 2
        if (j != 1) {
          write_csv(data.frame(double),outputFile,append = TRUE)
        } else {
          write_csv(data.frame(double),outputFile)
        }
    
        message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
      }
      close(connection.double)
    }
    
    calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
    

So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

Upvotes: 2

Related Questions