Reputation: 20
Premise: Looking to parse stream of objects from a json log file and output the total number of times "id.orig_h" connects to "id.resp_h" based on certain conditions and show the total count.
Sample json input:
jq --slurp --raw-output .
{
"ts": 1636606.998991,
"uid": "CgbTrLvhqHAa",
"id.orig_h": "10.8.21.11",
"id.orig_p": 54858,
"id.resp_h": "10.8.21.66",
"id.resp_p": 5044,
"proto": "tcp",
"conn_state": "S0",
"local_orig": true,
"local_resp": true,
"missed_bytes": 0,
"history": "S",
"orig_pkts": 1,
"orig_ip_bytes": 60,
"resp_pkts": 0,
"resp_ip_bytes": 0
},
{
"ts": 1636638.028568,
"uid": "CFNumGx3XYWW7",
"id.orig_h": "fe80::ba:61:fe3f:80",
"id.orig_p": 130,
"id.resp_h": "ff02::1",
"id.resp_p": 131,
"proto": "icmp",
"duration": 3420.447889374,
"orig_bytes": 2608,
"resp_bytes": 0,
"conn_state": "OTH",
"local_orig": false,
"local_resp": false,
"missed_bytes": 0,
"orig_pkts": 163,
"orig_ip_bytes": 11736,
"resp_pkts": 0,
"resp_ip_bytes": 0
},
{
"ts": 1636526872.598889,
"uid": "Cq9JTE1OweOW6mi",
"id.orig_h": "fe::63:88:14f5:b5",
"id.orig_p": 131,
"id.resp_h": "ff02::fb",
"id.resp_p": 130,
"proto": "icmp",
"duration": 81086.88094513,
"orig_bytes": 64000,
"resp_bytes": 0,
"conn_state": "OTH",
"local_orig": false,
"local_resp": false,
"missed_bytes": 0,
"orig_pkts": 4000,
"orig_ip_bytes": 288000,
"resp_pkts": 0,
"resp_ip_bytes": 0
},
{
"ts": 1636604547.798971,
"uid": "Cs41IjaZTAdF7f",
"id.orig_h": "fe::63:88:14f5:b5",
"id.orig_p": 131,
"id.resp_h": "ff02::1:ff:b5",
"id.resp_p": 130,
"proto": "icmp",
"duration": 3414.3990546265,
"orig_bytes": 2608,
"resp_bytes": 0,
"conn_state": "OTH",
"local_orig": false,
"local_resp": false,
"missed_bytes": 0,
"orig_pkts": 163,
"orig_ip_bytes": 11736,
"resp_pkts": 0,
"resp_ip_bytes": 0
}
I believe the conditions part is good
jq -r '. | select(.resp_ip_bytes > 0 and .orig_ip_bytes > 0 and .duration > 0 and .orig_bytes > 0 and .resp_bytes >0)'
However every time I try a
group_by([."id.orig_h", ."id.resp_h"]),
getting --> Cannot index number with string "id.orig_h"
Desired Output:
1.1.1.1 -> 2.2.2.2 | XXXX <- # of times
Here is the output without the join(" ")
jq -sr 'map(select(.resp_ip_bytes > 0 and .orig_ip_bytes > 0 and .duration > 0 and .orig_bytes > 0 and .resp_bytes >0)) | group_by([."id.orig_h", ."id.resp_h"]) | map(length as $count | .[] | .count = $count) | sort_by([-.count, -.resp_ip_bytes]) | first | [."id.orig_h", "->", ."id.resp_h", "|", .count]'
[
"10.8.21.11",
"->",
"10.8.21.123",
"|",
225 <--(not sure it matters but output on .count is yellow, all other output is green)
]
With the join(" ")
string (" ") and number (225) cannot be added
Upvotes: 0
Views: 319
Reputation: 116870
Although the built-in group_by
is convenient, it can be very inefficient with respect to both space and time, and the following alternative may be especially appropriate in the context of very long log files.
Notice that inputs
is used with the -n command-line option:
< log.json jq -nr '
# Emit a stream of arrays, each array being a group defined by a value of f,
# which can be any jq filter that produces exactly one value for each item in `stream`.
def GROUPS_BY(stream; f):
reduce stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s else ($s|tojson) end) as $y
| .[$t][$y] += [$x] )
| .[][] ;
GROUPS_BY(inputs
| select(.resp_ip_bytes > 0 and
.orig_ip_bytes > 0 and
.duration > 0 and .orig_bytes > 0 and
.resp_bytes >0);
[."id.orig_h", ."id.resp_h"] )
| (first | "\(."id.orig_h") -> \(."id.resp_h")" ) +
( .[] | " | \(length)" )
'
Notes:
a) GROUPS_BY
as defined above is stream-oriented, both with respect to its input and its output. Apart from that, the main functional difference between GROUPS_BY
and group_by
is that the latter involves a sort.
b) GROUPS_BY/2
as defined above is relatively complex because it is designed to have the full generality of group_by
, for which it is almost a plug-in alternative. Specifically, E | group_by(F)
is functionally equivalent to:
[GROUPS_BY(E[]; F)] | sort_by(F)
Upvotes: 0
Reputation: 36251
Try this
jq --slurp --raw-output '
map(select(.resp_ip_bytes > 0 ... your conditions here ...))
| group_by([."id.orig_h", ."id.resp_h"])
| map(length as $count | .[] | .count = $count)
| sort_by([-.count, -.resp_ip_bytes]) | first
| [."id.orig_h", "->", ."id.resp_h", "|", .count]
| join(" ")
'
This inserts another field value count
with the number of connections, then sorts first by the highest count
, then by the highest resp_ip_bytes
, takes the first match an formats the output as desired.
Upvotes: 1