Reputation: 55

How to read a 100+GB file with jq without running out of memory"

I have a 100+GB json file and when I try to read it with jq my computer keeps running our of ram. Is there a way to read the file while limiting the memory usage or some other way to read a VERY huge json file?

What I typed in the command: jq 'keys' fileName.json

Upvotes: 4

Answers (3)

peak

Reputation: 116740

jq's streaming parser (invoked using the --stream option) can generally handle very, very large files (and even arbitrarily large files provided certain conditions are met), but it is typically very slow and often quite cumbersome.

In practice, I find that tools such as jstream and/or my own jm work very nicely in conjunction with jq when dealing with ginormous files. When used this way, they are both very easy to use, though installation is potentially a bit of a hassle.

Unfortunately, if you know nothing at all about the contents of a JSON file except that jq empty takes too long or fails, then there is no CLI-tool that I know of that can produce a useful schema automagically. However, looking at the first few bytes of the file will usually provide enough information to get going. Or you could start with jm count to give a count of the top-level objects, and go from there. jm -s | jq 'keys[]' will give you the list of top-level keys if the top-level is a JSON object.

Here's an example. Suppose we have determined that the large size of the file ginormous.json is primarily because it consists of a very long top-level array. Then assuming that schema.jq (already mentioned elsewhere on this page) is in the pwd, you have some hope of finding an informative schema by running:

jm ginormous.json |
  jq -n 'include "schema" {source:"."}; schema(inputs)'

See also jq to recursively profile JSON object for a simpler schema-inference engine.

Upvotes: 2

knittl

Reputation: 265231

I posted a related question here: Difference between slurp, null input, and inputs filter

If your file is huge, but the documents inside the file aren't that big (just many many smaller ones), jq -n 'inputs' could get you started:

jq -n 'inputs | keys'

Here's an example (with a small file):

$ jq -n 'inputs | keys' <<JSON
{"foo": 21, "bar": "less interesting data"}
{"foo": 42, "bar": "more interesting data"}
JSON
[
  "bar",
  "foo"
]
[
  "bar",
  "foo"
]

This approach will not work if you have a single top-level object that is gigabytes big or has millions of keys.

Upvotes: 1

peak

Reputation: 116740

One generic way to determine the structure of a very large file containing a single JSON entity would be to run the following query:

jq -nc --stream -f structural-paths.jq huge.json | sort -u

where structural_paths.jq contains:

inputs
| select(length == 2)
| .[0]
| map( if type == "number" then 0 else . end )

Note that the '0's in the output signify that there is at least one valid array index at the corresponding position, not that '0' is actually a valid index at that position.

Note also that for very large files, using jq --stream to process the entire file could be quite slow.

Example:

Given {"a": {"b": [0,1, {"c":2}]}}, the result of the above incantation would be:

["a","b",0,"c"]
["a","b",0]

Top-level structure

If you just want more information about the top-level structure, you could simplify the above jq program to:

inputs | select(length==1)[0][0] | if type == "number" then 0 else . end

Structure to a given depth

If the command-line sort fails, then you might want to limit the number of paths by considering them only to a certain depth.

If the depth is not too great, then hopefully your command-line sort will be able to manage; if not, then using the command-line uniq would at least trim the output somewhat.

A better option might be to define unique(stream) in jq, and then use it, as illustrated here:

# Output: a stream of the distinct `tostring` values of the items in the stream
def uniques(stream):
  foreach (stream|tostring) as $s ({};
     if .[$s] then .emit = false else .emit = true | .item = $s | .[$s]=true end;
     if .emit then .item else empty end );

def spaths($depth):
  inputs
  | select(length==1)[0][0:$depth]
  | map(if type == "number" then 0 else . end);

uniques(spaths($depth))

A suitable invocation of jq would then look like:

jq -nr --argjson depth 3 --stream -f structural-paths.jq huge.json

Beside avoiding the costs of sorting, using uniques/1 will preserve the ordering of paths in the original JSON.

"JSON Pointer" pointers

If you want to convert array path expressions to "JSON Pointer" strings (e.g. for use with jm or jstream), simply append the following to the relevant jq program:

| "/" + join("/")