net_solv
net_solv

Reputation: 20

Consolidate stream of json objects with jq

Premise: Looking to parse stream of objects from a json log file and output the total number of times "id.orig_h" connects to "id.resp_h" based on certain conditions and show the total count.

Sample json input:

jq --slurp --raw-output . 
   
  {
    "ts": 1636606.998991,
    "uid": "CgbTrLvhqHAa",
    "id.orig_h": "10.8.21.11",
    "id.orig_p": 54858,
    "id.resp_h": "10.8.21.66",
    "id.resp_p": 5044,
    "proto": "tcp",
    "conn_state": "S0",
    "local_orig": true,
    "local_resp": true,
    "missed_bytes": 0,
    "history": "S",
    "orig_pkts": 1,
    "orig_ip_bytes": 60,
    "resp_pkts": 0,
    "resp_ip_bytes": 0
  },
  {
    "ts": 1636638.028568,
    "uid": "CFNumGx3XYWW7",
    "id.orig_h": "fe80::ba:61:fe3f:80",
    "id.orig_p": 130,
    "id.resp_h": "ff02::1",
    "id.resp_p": 131,
    "proto": "icmp",
    "duration": 3420.447889374,
    "orig_bytes": 2608,
    "resp_bytes": 0,
    "conn_state": "OTH",
    "local_orig": false,
    "local_resp": false,
    "missed_bytes": 0,
    "orig_pkts": 163,
    "orig_ip_bytes": 11736,
    "resp_pkts": 0,
    "resp_ip_bytes": 0
  },
  {
    "ts": 1636526872.598889,
    "uid": "Cq9JTE1OweOW6mi",
    "id.orig_h": "fe::63:88:14f5:b5",
    "id.orig_p": 131,
    "id.resp_h": "ff02::fb",
    "id.resp_p": 130,
    "proto": "icmp",
    "duration": 81086.88094513,
    "orig_bytes": 64000,
    "resp_bytes": 0,
    "conn_state": "OTH",
    "local_orig": false,
    "local_resp": false,
    "missed_bytes": 0,
    "orig_pkts": 4000,
    "orig_ip_bytes": 288000,
    "resp_pkts": 0,
    "resp_ip_bytes": 0
  },
  {
    "ts": 1636604547.798971,
    "uid": "Cs41IjaZTAdF7f",
    "id.orig_h": "fe::63:88:14f5:b5",
    "id.orig_p": 131,
    "id.resp_h": "ff02::1:ff:b5",
    "id.resp_p": 130,
    "proto": "icmp",
    "duration": 3414.3990546265,
    "orig_bytes": 2608,
    "resp_bytes": 0,
    "conn_state": "OTH",
    "local_orig": false,
    "local_resp": false,
    "missed_bytes": 0,
    "orig_pkts": 163,
    "orig_ip_bytes": 11736,
    "resp_pkts": 0,
    "resp_ip_bytes": 0
   }

I believe the conditions part is good

    jq -r '. | select(.resp_ip_bytes > 0 and .orig_ip_bytes > 0 and .duration > 0 and .orig_bytes > 0 and .resp_bytes >0)'

However every time I try a

    group_by([."id.orig_h", ."id.resp_h"]), 

getting --> Cannot index number with string "id.orig_h"

Desired Output:

1.1.1.1 -> 2.2.2.2 | XXXX <- # of times

Here is the output without the join(" ")

jq -sr 'map(select(.resp_ip_bytes > 0 and .orig_ip_bytes > 0 and .duration > 0 and .orig_bytes > 0 and .resp_bytes >0)) | group_by([."id.orig_h", ."id.resp_h"]) | map(length as $count | .[] | .count = $count) | sort_by([-.count, -.resp_ip_bytes]) | first | [."id.orig_h", "->", ."id.resp_h", "|", .count]'
[
  "10.8.21.11",
  "->",
  "10.8.21.123",
  "|",
  225 <--(not sure it matters but output on .count is yellow, all other output is green)
]

With the join(" ")

string (" ") and number (225) cannot be added

Upvotes: 0

Views: 319

Answers (2)

peak
peak

Reputation: 116870

Although the built-in group_by is convenient, it can be very inefficient with respect to both space and time, and the following alternative may be especially appropriate in the context of very long log files.

Notice that inputs is used with the -n command-line option:

< log.json jq -nr '
# Emit a stream of arrays, each array being a group defined by a value of f,
# which can be any jq filter that produces exactly one value for each item in `stream`.
def GROUPS_BY(stream; f): 
   reduce stream as $x ({};
     ($x|f) as $s
     | ($s|type) as $t
     | (if $t == "string" then $s else ($s|tojson) end) as $y
     | .[$t][$y] += [$x] )
   | .[][] ;

GROUPS_BY(inputs
          | select(.resp_ip_bytes > 0 and
                   .orig_ip_bytes > 0 and
                   .duration > 0 and .orig_bytes > 0 and
                   .resp_bytes >0);
          [."id.orig_h", ."id.resp_h"] ) 
| (first | "\(."id.orig_h") -> \(."id.resp_h")" ) +
  ( .[]  | " | \(length)" )
'

Notes:

a) GROUPS_BY as defined above is stream-oriented, both with respect to its input and its output. Apart from that, the main functional difference between GROUPS_BY and group_by is that the latter involves a sort.

b) GROUPS_BY/2 as defined above is relatively complex because it is designed to have the full generality of group_by, for which it is almost a plug-in alternative. Specifically, E | group_by(F) is functionally equivalent to:

[GROUPS_BY(E[]; F)] | sort_by(F)

Upvotes: 0

pmf
pmf

Reputation: 36251

Try this

jq --slurp --raw-output '
  map(select(.resp_ip_bytes > 0 ... your conditions here ...))
  | group_by([."id.orig_h", ."id.resp_h"])
  | map(length as $count | .[] | .count = $count)
  | sort_by([-.count, -.resp_ip_bytes]) | first
  | [."id.orig_h", "->", ."id.resp_h", "|", .count]
  | join(" ")
'

This inserts another field value count with the number of connections, then sorts first by the highest count, then by the highest resp_ip_bytes, takes the first match an formats the output as desired.

Upvotes: 1

Related Questions