Honoki
Honoki

Reputation: 461

Handle a very large input file without slurp

I am working with JSON output from a tool (massdns) that is formatted as follows:

{"query_name":"1eaff.example.com.","query_type":"A","resp_name":"ns02.example.com.","resp_type":"A","data":"<ip>"}
{"query_name":"1cf0e.example.com.","query_type":"A","resp_name":"ns01.example.com.","resp_type":"A","data":"<ip>"}
{"query_name":"1cf0e.example.com.","query_type":"A","resp_name":"ns02.example.com.","resp_type":"A","data":"<ip>"}
{"query_name":"1fwsjz2f4ok1ot2hh2illyd1-wpengine.example.com.","query_type":"A","resp_name":"ns01.example.com.","resp_type":"A","data":"<ip>"}
{"query_name":"1fwsjz2f4ok1ot2hh2illyd1-wpengine.example.com.","query_type":"A","resp_name":"ns02.example.com.","resp_type":"A","data":"<ip>"}
{"query_name":"1a811.example.com.","query_type":"A","resp_name":"ns01.example.com.","resp_type":"A","data":"<ip>"}

I am able to use jq with slurp (-s) to beautifully output the results in the format I need:

jq -s '{ a: "xxx", "b": 123, domains: map(select(.resp_type=="A") | .resp_name[:-1] ) | unique }'

This yields a JSON string like:

{
  "a": "xxx",
  "b": 123,
  "domains": [
    "ns01.example.com",
    "ns02.example.com"
  ]
}

(See JQPlay example here.)

My problem occurs when my input scales to hundreds of thousands of lines (GBs of data), in which case slurp becomes too memory-consuming, and jq exits with an error.

I have discovered the --stream option, which allows handling large inputs, but am struggling to find a way to obtain the same output. Is there a way to use --stream (and not --slurp) to get the wanted output for a very large input file with jq?

Upvotes: 2

Views: 431

Answers (1)

oguz ismail
oguz ismail

Reputation: 50750

--stream would overcomplicate this task, use --null-input/-n option in conjunction with reduce instead.

{a: "xxx", b: 123}
| .domains = (reduce (inputs|select(.query_type == "A").resp_name) as $d
  ({}; . + {($d): null}) | keys_unsorted | map(.[:-1]))

Keeping domains in an object as keys instead of an array makes this script even more efficient in terms of memory consumption and cpu time; in jq, Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. If both objects contain a value for the same key, the object on the right of the + wins. Thus no need to unique.

Trimming the last char off (.[:-1]) all resp_names slows down the process as well, performing map(.[:-1]) on resulting array instead is more efficient.

See it on jqplay.

Upvotes: 3

Related Questions