Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

Question

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:

library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
  write(paste(i, 'is the current iteration'), myfile, append=T)
  z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
  write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}

With nx=5 and ny=2, I would have a file like this:

# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...

I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?

Given the output is regular, I thought readr would be a good idea (?). The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:

library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size 
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
  nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
  z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
  z = as.vector(t(z))
  ifirst = (i-1)*ny*nx + 1 # appropriate index
  ztot[ifirst:(ifirst+nx*ny-1)] = z
}

# The arrays are actually spatial rasters. Compute the coordinates 
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter) 
y = rep(rep(seq(1:ny), each=nx), niter) 

myDF = data.frame(x=x, y=y, z=z)

But this is not fast enough. How can I achieve this faster?

Is there a way to read everything at once and delete the useless rows afterwards?

Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?

PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...

EDIT The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:

bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='
') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))

Discussion on how to proceed results from the readLines can be found here

ztl · Accepted Answer

Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:

tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))

Lines can be removed on the basis of a pattern or of indices for example. This was suggested by @Roland from here.

Fast reading (by chunk?) and processing of a file with dummy lines at regular interval in R

Answers (2)

Related Questions