Chris
Chris

Reputation: 313

Reading in very, very large NDJSON

I have a 33GB NDJSON file I need to read into a data.table in R. It's gzipped into a 2GB file, ideally I would like to keep it compressed.

The structure isn't so important except that (when imported via jsonlite::stream_in), the data I need are in only a few simple columns. The vast majority of the weight of the data is held in lists within three columns I want to discard as soon as possible.

My two challenges are: how can I parallelize the read-in, and how can I limit memory usage (right now my worker on this file is using 175GB memory)?

What I'm doing now:

dt.x <- data.table(flatten(stream_in(gzfile("source.gz"))[, -c(5:7)]))

Ideas:

Maybe there is some way to ignore a portion of the NDJSON during stream_in?

Could I parse the gzfile connection, eg with regex, before it goes to stream_in, to remove the excess data?

Can I do something like readLines on the gzfile connection to read the data 1 million lines per worker?

EDIT: If at all possible, my goal is to make this portable to other users and keep it entirely within R.

Upvotes: 3

Views: 2218

Answers (1)

peak
peak

Reputation: 116850

Using jqr with readr

Here is a transcript illustrating how to use jqr to read a gzipped NDJSON (aka JSONL) file:

$ R --vanilla
> library(readr)
> library(jqr)
> read_lines("objects.json.gz") %>% jq('.a')
[
    1,
    2,
    3
]
> 

Using read_file() yields the same result. Since these functions must unzip the entire file, the memory requirements will be substantial.

Reading each JSON entity separately

Since the file is NDJSON, we can drastically reduce the amount of RAM required by reading in one JSON entity at a time:

con = file("objects.json", "r");
while ( length(line <- readLines(con, n = 1)) > 0) {
   print( line %>% jq('.a') );
}

jq

There are probably better ways to use jqr, but if the goal is both space and time efficiency, then it might be best to use the command-line version of jq.

Count

If you need to count the number of lines in the (unzipped) file beforehand, then to save memory, I'd probably use system2 and wc if possible; all else failing, you could run a snippet like so:

n<-0;
con = file("objects.json", "r");
while (TRUE) {
   readLines(con, n = 1);
   if (length(line) == 0) { break; }
    n <- n+1;
}

Upvotes: 2

Related Questions