R_User
R_User

Reputation: 11082

Open large files with R

I want to process a file (1.9GB) that contains 100.000.000 datasets in R. Actually I only want to have every 1000th dataset. Each dataset contains 3 Columns, separated by a tab. I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once. Can I tell R directly to load only every 1000th dataset from the file?

After reading the file I want to bin the data of column 2. Is it possible to directly bin the number written in column 2? Is it possible the read the file line by line, without loading the whole file into the memory?

Thanks for your help.

Sven

Upvotes: 1

Views: 5699

Answers (3)

neilfws
neilfws

Reputation: 33782

You should pre-process the file using another tool before reading into R.

To write every 1000th line to a new file, you can use sed, like this:

sed -n '0~1000p' infile > outfile

Then read the new file into R:

datasets <- read.table("outfile", sep = "\t", header = F)

Upvotes: 7

Luciano Selzer
Luciano Selzer

Reputation: 10016

Maybe package colbycol could be usefull to you.

Upvotes: 1

Dirk is no longer here
Dirk is no longer here

Reputation: 368231

You may want to look at the manual devoted to R Data Import/Export.

Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.

In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.

Upvotes: 7

Related Questions