Reputation: 55
I have a 100+GB json file and when I try to read it with jq my computer keeps running our of ram. Is there a way to read the file while limiting the memory usage or some other way to read a VERY huge json file?
What I typed in the command: jq 'keys' fileName.json
Upvotes: 4
Views: 2459
Reputation: 116740
jq's streaming parser (invoked using the --stream option) can generally handle very, very large files (and even arbitrarily large files provided certain conditions are met), but it is typically very slow and often quite cumbersome.
In practice, I find that tools such as jstream and/or my own jm work very nicely in conjunction with jq when dealing with ginormous files. When used this way, they are both very easy to use, though installation is potentially a bit of a hassle.
Unfortunately, if you know nothing at all about the contents of a JSON file except that jq empty
takes too long or fails, then there is no CLI-tool that I know of that can produce a useful schema automagically. However, looking at the first few bytes of the file will usually provide enough information to get going. Or you could start with jm count
to give a count of the top-level objects, and go from there. jm -s | jq 'keys[]'
will give you the list of top-level keys if the top-level is a JSON object.
Here's an example. Suppose we have determined that the large size of the file ginormous.json is primarily because it consists of a very long top-level array. Then assuming that schema.jq (already mentioned elsewhere on this page) is in the pwd, you have some hope of finding an informative schema by running:
jm ginormous.json |
jq -n 'include "schema" {source:"."}; schema(inputs)'
See also jq to recursively profile JSON object for a simpler schema-inference engine.
Upvotes: 2
Reputation: 265231
I posted a related question here: Difference between slurp, null input, and inputs filter
If your file is huge, but the documents inside the file aren't that big (just many many smaller ones), jq -n 'inputs'
could get you started:
jq -n 'inputs | keys'
Here's an example (with a small file):
$ jq -n 'inputs | keys' <<JSON
{"foo": 21, "bar": "less interesting data"}
{"foo": 42, "bar": "more interesting data"}
JSON
[
"bar",
"foo"
]
[
"bar",
"foo"
]
This approach will not work if you have a single top-level object that is gigabytes big or has millions of keys.
Upvotes: 1
Reputation: 116740
One generic way to determine the structure of a very large file containing a single JSON entity would be to run the following query:
jq -nc --stream -f structural-paths.jq huge.json | sort -u
where structural_paths.jq
contains:
inputs
| select(length == 2)
| .[0]
| map( if type == "number" then 0 else . end )
Note that the '0's in the output signify that there is at least one valid array index at the corresponding position, not that '0' is actually a valid index at that position.
Note also that for very large files, using jq --stream to process the entire file could be quite slow.
Given {"a": {"b": [0,1, {"c":2}]}}
, the result of the above incantation would be:
["a","b",0,"c"]
["a","b",0]
If you just want more information about the top-level structure, you could simplify the above jq program to:
inputs | select(length==1)[0][0] | if type == "number" then 0 else . end
If the command-line sort
fails, then you might want to limit the number of paths by considering them only to a certain depth.
If the depth is not too great, then hopefully your command-line sort
will be able to manage; if not, then using the command-line uniq
would at least trim the output somewhat.
A better option might be to define unique(stream)
in jq, and then use it, as illustrated here:
# Output: a stream of the distinct `tostring` values of the items in the stream
def uniques(stream):
foreach (stream|tostring) as $s ({};
if .[$s] then .emit = false else .emit = true | .item = $s | .[$s]=true end;
if .emit then .item else empty end );
def spaths($depth):
inputs
| select(length==1)[0][0:$depth]
| map(if type == "number" then 0 else . end);
uniques(spaths($depth))
A suitable invocation of jq would then look like:
jq -nr --argjson depth 3 --stream -f structural-paths.jq huge.json
Beside avoiding the costs of sorting, using uniques/1
will preserve the ordering of paths in the original JSON.
If you want to convert array path expressions to "JSON Pointer" strings (e.g. for use with jm
or jstream
), simply append the following to the relevant jq program:
| "/" + join("/")
Upvotes: 2