Reputation: 313
I have a 33GB NDJSON file I need to read into a data.table in R. It's gzipped into a 2GB file, ideally I would like to keep it compressed.
The structure isn't so important except that (when imported via jsonlite::stream_in
), the data I need are in only a few simple columns. The vast majority of the weight of the data is held in list
s within three columns I want to discard as soon as possible.
My two challenges are: how can I parallelize the read-in, and how can I limit memory usage (right now my worker on this file is using 175GB memory)?
What I'm doing now:
dt.x <- data.table(flatten(stream_in(gzfile("source.gz"))[, -c(5:7)]))
Ideas:
Maybe there is some way to ignore a portion of the NDJSON during stream_in
?
Could I parse the gzfile
connection, eg with regex, before it goes to stream_in
, to remove the excess data?
Can I do something like readLines
on the gzfile
connection to read the data 1 million lines per worker?
EDIT: If at all possible, my goal is to make this portable to other users and keep it entirely within R.
Upvotes: 3
Views: 2218
Reputation: 116850
Here is a transcript illustrating how to use jqr to read a gzipped NDJSON (aka JSONL) file:
$ R --vanilla
> library(readr)
> library(jqr)
> read_lines("objects.json.gz") %>% jq('.a')
[
1,
2,
3
]
>
Using read_file()
yields the same result. Since these functions must unzip the entire file, the memory requirements will be substantial.
Since the file is NDJSON, we can drastically reduce the amount of RAM required by reading in one JSON entity at a time:
con = file("objects.json", "r");
while ( length(line <- readLines(con, n = 1)) > 0) {
print( line %>% jq('.a') );
}
There are probably better ways to use jqr, but if the goal is both space and time efficiency, then it might be best to use the command-line version of jq.
If you need to count the number of lines in the (unzipped) file beforehand, then to save memory, I'd probably use system2
and wc
if possible; all else failing, you could run a snippet like so:
n<-0;
con = file("objects.json", "r");
while (TRUE) {
readLines(con, n = 1);
if (length(line) == 0) { break; }
n <- n+1;
}
Upvotes: 2