ben_aaron
ben_aaron

Reputation: 1522

Use readLines in successive chunks R

I've got a file with 2m+ lines.

To avoid memory overload, I want to read these lines in chunks and then perform further processing with the lines in the chunk.

I read that readLines is the fastest but I could not find a way to read chunks with readlines.

raw = readLines(target_file, n = 500)

But what I'd want is to then have a readLines for n = 501:1000, e.g.

raw = readLines(target_file, n = 501:1000)

Is there a way to do this in R?

Upvotes: 1

Views: 793

Answers (2)

ben_aaron
ben_aaron

Reputation: 1522

Maybe this helps someone in the future:

The readr package has just what I was looking for: a function to read lines in chunks.

read_lines_chunked reads a file in chunks of lines and then expects a callback to be run on these chunks.

Let f be the function needed for storing a chunk for later use:

f = function(x, pos){
 filename = paste("./chunks/chunk_", pos, ".RData", sep="")
 save(x, file = filename)
}

Then I can use this in the main wrapper as:

read_lines_chunked(file = target_json
               , chunk_size = 10000
               , callback = SideEffectChunkCallback$new(f)
               )

Works.

Upvotes: 3

PavoDive
PavoDive

Reputation: 6496

I don't know how many variables (columns) you have, but data.table::fread is a very fast alternative to what you want:

require(data.table)

raw <- fread(target_file)

Upvotes: 0

Related Questions