Reputation: 11082
I want to process a file (1.9GB) that contains 100.000.000 datasets in R. Actually I only want to have every 1000th dataset. Each dataset contains 3 Columns, separated by a tab. I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once. Can I tell R directly to load only every 1000th dataset from the file?
After reading the file I want to bin the data of column 2. Is it possible to directly bin the number written in column 2? Is it possible the read the file line by line, without loading the whole file into the memory?
Thanks for your help.
Sven
Upvotes: 1
Views: 5699
Reputation: 33782
You should pre-process the file using another tool before reading into R.
To write every 1000th line to a new file, you can use sed, like this:
sed -n '0~1000p' infile > outfile
Then read the new file into R:
datasets <- read.table("outfile", sep = "\t", header = F)
Upvotes: 7
Reputation: 368231
You may want to look at the manual devoted to R Data Import/Export.
Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.
In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.
Upvotes: 7