Reputation: 821
I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix
object. Here is a simplified example:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv
contains
4
18
2
14
16
Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix
object and it appears to change with each loop iteration:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
Can big.matrix
objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!
Upvotes: 4
Views: 831
Reputation: 946
- is there a better way to load, process and then save these data? The current method is slow!
The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.
Here's the plan:
Push that chunk back into a new file to save for later
library(readr)
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))
write_csv(x,"./test_set2.csv")
# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
# Set up variables
num.lines <- 0
lines.per <- NULL
var.top <- NULL
i=0L
# Gather column names and position of objective column
connection.names <- file(calc.file,open="r+")
data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
close(connection.names)
col.name <- which(colnames(data.names)==variable)
#Find length of file by line
connection.len <- file(calc.file,open="r+")
while((linesread <- length(readLines(connection.len,read.size)))>0){
lines.per[i] <- linesread
num.lines <- num.lines + linesread
i=i+1L
}
close(connection.len)
# Make connection for doubling function
# Loop through file and double the set variables
connection.double <- file(calc.file,open="r+")
for (j in 1:length(lines.per)){
# if stops read.table from breaking
# Read in a chunk of the file
if (j == 1) {
data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
} else {
data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
}
# Grab the columns we need and double them
double <- data[,I(col.name)] * 2
if (j != 1) {
write_csv(data.frame(double),outputFile,append = TRUE)
} else {
write_csv(data.frame(double),outputFile)
}
message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
}
close(connection.double)
}
calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2
to whatever thing you need to do to each chunk.
Upvotes: 2