Reputation: 6104
i cannot load the file into RAM (assume a user might want the first billion of a file with ten billion records)
here is my solution, but i think there has got to be a faster way?
thanks
# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000
# my attempt
incon <- file( infile , "r")
outcon <- file( outfile , "w")
for ( i in seq( num_lines ) ){
line <- readLines( incon , 1 )
writeLines( line , outcon )
}
close( incon )
close( outcon )
Upvotes: 8
Views: 1273
Reputation: 269644
Try the head
utility. It should be available on all operating systems that R supports (on Windows it assumes you have Rtools installed and the Rtools bin directory is on your path). For example, to copy the first 100 lines from in.dat to out.dat :
shell("head -n 100 in.dat > out.dat")
Upvotes: 2
Reputation: 8105
C++ solution
It is not too difficult to write some c++ code for this:
#include <fstream>
#include <R.h>
#include <Rdefines.h>
extern "C" {
// [[Rcpp::export]]
SEXP dump_n_lines(SEXP rin, SEXP rout, SEXP rn) {
// no checks on types and size
std::ifstream strin(CHAR(STRING_ELT(rin, 0)));
std::ofstream strout(CHAR(STRING_ELT(rout, 0)));
int N = INTEGER(rn)[0];
int n = 0;
while (strin && n < N) {
char c = strin.get();
if (c == '\n') ++n;
strout.put(c);
}
strin.close();
strout.close();
return R_NilValue;
}
}
When saved as yourfile.cpp
, you can do
Rcpp::sourceCpp('yourfile.cpp')
From RStudio you don't have to load anything. In the console you will have to load Rcpp. You will probably have to install Rtools in Windows.
More efficient R-code
By reading larger blocks instead of single lines your code will also speed up:
dump_n_lines2 <- function(infile, outfile, num_lines, block_size = 1E6) {
incon <- file( infile , "r")
outcon <- file( outfile , "w")
remain <- num_lines
while (remain > 0) {
size <- min(remain, block_size)
lines <- readLines(incon , n = size)
writeLines(lines , outcon)
# check for eof:
if (length(lines) < size) break
remain <- remain - size
}
close( incon )
close( outcon )
}
Benchmark
lines <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean commodo
imperdiet nunc, vel ultricies felis tincidunt sit amet. Aliquam id nulla eu mi
luctus vestibulum ac at leo. Integer ultrices, mi sit amet laoreet dignissim,
orci ligula laoreet diam, id elementum lorem enim in metus. Quisque orci neque,
vulputate ultrices ornare ac, interdum nec nunc. Suspendisse iaculis varius
dapibus. Donec eget placerat est, ac iaculis ipsum. Pellentesque rhoncus
maximus ipsum in hendrerit. Donec finibus posuere libero, vitae semper neque
faucibus at. Proin sagittis lacus ut augue sagittis pulvinar. Nulla fermentum
interdum orci, sed imperdiet nibh. Aliquam tincidunt turpis sit amet elementum
porttitor. Aliquam lectus dui, dapibus ut consectetur id, mollis quis magna.
Donec dapibus ac magna id bibendum."
lines <- rep(lines, 1E6)
writeLines(lines, con = "big.txt")
infile <- "big.txt"
outfile <- "small.txt"
num_lines <- 1E6L
library(microbenchmark)
microbenchmark(
solution0(infile, outfile, num_lines),
dump_n_lines2(infile, outfile, num_lines),
dump_n_lines(infile, outfile, num_lines)
)
Results in (solution0 is the OP's original solution):
Unit: seconds
expr min lq mean median uq max neval cld
solution0(infile, outfile, num_lines) 11.523184 12.394079 12.635808 12.600581 12.904857 13.792251 100 c
dump_n_lines2(infile, outfile, num_lines) 6.745558 7.666935 7.926873 7.849393 8.297805 9.178277 100 b
dump_n_lines(infile, outfile, num_lines) 1.852281 2.411066 2.776543 2.844098 2.965970 4.081520 100 a
The c++ solution can probably be sped up by reading in large blocks of data at a time. However, this will make the code much more complex. Unless this is something I would have to do on a very regular basis, I would probably stick with the pure R solution.
Remark: when your data is tabular, you can use my LaF
package to read arbitrary lines and columns from your data set without having to read all of the data into memory.
Upvotes: 6
Reputation: 3440
The operating system is the best level to do big file manipulations. This is quick, and comes with a benchmark (which seems important, given the poster asked about a faster method):
# create test file in shell
echo "hello
world" > file.txt
for i in {1..29}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done
wc -l file.txt
# about a billion rows
This takes a few seconds for a billions rows. Change 29 to 32 in order to get about ten billion.
Then in R, using ten million rows from the billion (hundred million way too slow to compare with poster's solution)
# in R, copy first ten million rows of the billion
system.time(
system("head -n 10000000 file.txt > out.txt")
)
# posters solution
system.time({
infile <- "file.txt"
outfile <- "out.txt"
num_lines <- 1e7
incon <- file( infile , "r")
outcon <- file( outfile , "w")
for ( i in seq( num_lines )) {
line <- readLines( incon , 1 )
writeLines( line , outcon )
}
close( incon )
close( outcon )
})
And the results on a mid-range MacBook pro, couple of years old.
Rscript head.R
user system elapsed
1.349 0.164 1.581
user system elapsed
620.665 3.614 628.260
Would be interested to see how fast the other solutions are.
Upvotes: 3
Reputation: 3198
The "right" or best answer for this would be to use a language that works much more easily with filehandles. For instance, while perl is an ugly language in many ways, this is where it shines. Python can also do this very well, in a more verbose fashion.
However, you have explicitly stated you want things in R. First, I'll assume that this thing might not be a CSV or other delimited flat file.
Use the library readr
. Within that library, use read_lines()
. Something like this (first, get the # of lines in the entire file,using something like what is shown here):
library(readr)
# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000
# readr attempt
# num_lines_tot is found via the method shown in the link above
num_loops <- ceiling(num_lines_tot / num_lines)
incon <- file( infile , "r")
outcon <- file( outfile , "w")
for ( i in seq(num_loops) ){
lines <- read_lines(incon, skip= (i - 1) * num_lines,
n_max = num_lines)
writeLines( lines , outcon )
}
close( incon )
close( outcon )
A few things to note:
readr
that is as generic as it seems you want. (There is, for instance, write_delim
, but you did not specify delimited.)"outfile"
will be lost. I am not sure if you meant to open "outfile"
in append mode ("a"
), but I suspect that would be helpful.read_csv
or read_delim
within the readr
package.Upvotes: 2
Reputation: 37879
You can use ff::read.table.ffdf
for this. It stores the data on the hard disk and it does not use any RAM.
library(ff)
infile <- read.table.ffdf(file = "/some/big/file.txt")
Essentially you can use the above function in the same way as base::read.table
with the difference that the resulting object will be stored on the hard disk.
You can also use the nrow
argument and load specific number of rows. The documentation is here if you want to have a read. Once, you have read the file, then you can subset the specific rows you need and even convert them to data.frames
if they can fit the RAM.
There is also a write.table.ffdf
function that will allow you to write an ffdf
object (resulting from read.table.ffdf
) which will make the process even easier.
As an example of how to use read.table.ffdf
(or read.delim.ffdf
which is pretty much the same thing) see the following:
#writting a file on my current directory
#note that there is no standard number of columns
sink(file='test.txt')
cat('foo , foo, foo\n')
cat('foo, foo\n')
cat('bar bar , bar\n')
sink()
#read it with read.delim.ffdf or read.table.ffdf
read.delim.ffdf(file='test.txt', sep='\n', header=F)
Output:
ffdf (all open) dim=c(3,1), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
PhysicalName VirtualVmode PhysicalVmode AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
V1 V1 integer integer FALSE FALSE FALSE 1 1 1 TRUE
ffdf data
V1
1 foo , foo, foo
2 foo, foo
3 bar bar , bar
If you are using a txt file then this is a general solution as each line will finish with a \n
character.
Upvotes: 7
Reputation: 4614
I usually speed up such loops by reading and writing by chunks of, say, 1000 lines. If num_lines
is a multiple of 1000, the code becomes:
# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000000
# my attempt
incon <- file( infile, "r")
outcon <- file( outfile, "w")
step1 = 1000
nsteps = ceiling(num_lines/step1)
for ( i in 1:nsteps ){
line <- readLines( incon, step1 )
writeLines( line, outcon )
}
close( incon )
close( outcon )
Upvotes: 3
Reputation: 368241
I like pipes for that as we can use other tools. And conveniently, the (truly excellent) connections interface in R supports it:
## scratch file
filename <- "foo.txt"
## create a file, no header or rownames for simplicity
write.table(1:50, file=filename, col.names=FALSE, row.names=FALSE)
## sed command: print from first address to second, here 4 to 7
## the -n suppresses output unless selected
cmd <- paste0("sed -n -e '4,7p' ", filename)
##print(cmd) # to debug if needed
## we use the cmd inside pipe() as if it was file access so
## all other options to read.csv (or read.table) are available too
val <- read.csv(pipe(cmd), header=FALSE, col.names="selectedRows")
print(val, row.names=FALSE)
## clean up
unlink(filename)
If we run this, we get rows four to seven as expected:
edd@max:/tmp$ r piper.R
selectedRows
4
5
6
7
edd@max:/tmp$
Note that our use of sed
made no assumptions about the file structures besides assuming
If you assumed binary files with different record separators we could suggest different solutions.
Also note that you control the command passed onto the pipe()
functions. So if you want rows 1000004 to 1000007 the usage is exactly the same: you just give the first and last row (of each segment, there can be several). And instead of read.csv()
your readLines()
could be used equally well.
Lastly, sed
is available everywhere and, if memory serves, also part of Rtools. The basic filtering functionality can also be obtained with Perl or a number of other tools.
Upvotes: 7
Reputation: 189
try using
line<-read.csv(infile,nrow=1000)
write(line,file=outfile,append=T)
Upvotes: -2