Reputation: 2592
I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:
library(gdata)
nx = 150 # ncol of my arrays
ny = 130 # nrow of my arrays
myfile = 'bigFileWithRowsToSkip.txt'
niter = 10
for (i in 1:niter) {
write(paste(i, 'is the current iteration'), myfile, append=T)
z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny)
write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format
}
With nx=5
and ny=2
, I would have a file like this:
# 1 is the current iteration
# 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316
# 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571
# 2 is the current iteration
# 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591
# 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031
# 3 is the current iteration
# ...
I want to read the successive arrays as fast as possible to put them in a single data.frame
(in reality, I have thousands of them). What is the most efficient way to proceed?
Given the output is regular, I thought readr
would be a good idea (?).
The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:
library(readr)
ztot = numeric(niter*nx*ny) # allocate a vector with final size
# (the arrays will be vectorized and successively appended to each other)
for (i in 1:niter) {
nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines
z = read_table(myfile, skip = nskip, n_max = ny, col_names=F)
z = as.vector(t(z))
ifirst = (i-1)*ny*nx + 1 # appropriate index
ztot[ifirst:(ifirst+nx*ny-1)] = z
}
# The arrays are actually spatial rasters. Compute the coordinates
# and put everything in DF for future analysis:
x = rep(rep(seq(1:nx), ny), niter)
y = rep(rep(seq(1:ny), each=nx), niter)
myDF = data.frame(x=x, y=y, z=z)
But this is not fast enough. How can I achieve this faster?
Is there a way to read everything at once and delete the useless rows afterwards?
Alternatively, is there no reading function accepting a vector with precise locations as skip
argument, rather than a single number of initial rows?
PS: note the reading operation is to be repeated on many files (same structure) located in different directories, in case it influences the solution...
EDIT
The following solution (reading all lines with readLines
and removing the undesirable ones and then processing the rest) is a faster alternative with niter
very high:
bylines <- readLines(myfile)
dummylines = seq(1, by=(ny+1), length.out=niter)
bylines = bylines[-dummylines] # remove dummy, undesirable lines
asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines
library(data.table)
ztot <- fread(asOneVector)
ztot <- c(t(ztot))
Discussion on how to proceed results from the readLines
can be found here
Upvotes: 0
Views: 117
Reputation: 2592
Pre-processing the file with a command line tool (i.e., not in R
) is actually way faster. For example with awk
:
tmpfile <- 'cleanFile.txt'
mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile)
# "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt"
system(mycommand) # call the command from R
ztot <- fread(tmpfile)
ztot <- c(t(ztot))
Lines can be removed on the basis of a pattern or of indices for example. This was suggested by @Roland from here.
Upvotes: 2
Reputation: 2757
Not sure if I still understood your problem correctly. Running your script created a file with 1310 lines. With This is iteration 1or2or3
printed at lines
Line 1: This is iteration 1
Line 132: This is iteration 2
Line 263: This is iteration 3
Line 394: This is iteration 4
Line 525: This is iteration 5
Line 656: This is iteration 6
Line 787: This is iteration 7
Line 918: This is iteration 8
Line 1049: This is iteration 9
Line 1180: This is iteration 10
Now there is data between these lines that you want to read and skip this 10 strings.
You can do this by tricking read.table
saying your comment.char
is "T" which will make read.table
thinks all lines starting with letter "T" are comments and will skip those.
data<-read.table("bigFile.txt",comment.char = "T")
this will give you a data.frame
of 1300
observations with 150
variables.
> dim(data)
[1] 1300 150
For a non-consisted strings. Read your data with read.table
with fill=TRUE
flag. This will not break your input process.
data<-read.table("bigFile.txt",fill=TRUE)
Your data looks like this
> head(data)
V1 V2 V3 V4 V5 V6 V7
1: 1.0000000 is the current iteration NA NA
2: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
3: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
4: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
5: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
6: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
Now if you see how the strings are distributed in columns. Now you can simply subset your data set with pattern matching. Matching columns that match these strings. For example
library(data.table)
data<-as.data.table(data)
cleaned_data<-data[!(V3 %like% "the"),]
> head(cleaned_data)
V1 V2 V3 V4 V5 V6 V7
1: 0.4231829 0.142353335 0.3813622692 0.07224282 0.037681101 0.7761575 0.1132471
2: 0.1113989 0.587115721 0.2960257430 0.49175715 0.642754463 0.4036675 0.4940814
3: 0.9750350 0.691093967 0.8610487920 0.08208387 0.826175117 0.8789275 0.3687355
4: 0.1831840 0.001007096 0.2385952028 0.85939856 0.646992019 0.5783946 0.9095849
5: 0.7648907 0.204005372 0.8512769730 0.10731854 0.299391995 0.9200760 0.7814541
6: 0.3943193 0.508373900 0.2131134905 0.92474343 0.432134031 0.4585807 0.9811607
Upvotes: 0