Reputation: 65
I have a json file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"234432","cust_name":"ghi"}
{"caller_id":"123321","cust_name":"abc"}
....
I tried:
jq -s 'unique_by(.field1)'
but this will remove all the duplicated items, I,m looking to keep just one of the duplicated items, to get the file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"234432","cust_name":"ghi"}
....
Upvotes: 1
Views: 2633
Reputation: 116880
If the file consists of a sequence (stream) of JSON objects, then a very simple way to produce a stream of the distinct objects would be to use the invocation:
jq -s `unique[]`
A similar alternative would be:
jq -n `[inputs] | unique[]`
For large files, however, the above will probably be too inefficient, both with respect to RAM and run-time. Note that both unique
and unique_by
entail a sort.
A far better alternative would be to take advantage of the fact that the input is a stream, and to avoid the built-in unique
and unique_by
filters. This can be done with the assistance of the following filters, which are not yet built-in but likely to become so:
# emit a dictionary
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);
# distinct entities in the stream s
def distinct(s): set(s)[];
We now have only to add:
distinct(inputs)
to achieve the objective, provided jq is invoked with the -n command-line option.
This approach will also preserve the original ordering.
If the input is an array, then using distinct
as defined above still has the advantage of not requiring a sort. For arrays that are too large to fit comfortably in memory, it would be advisable to use jq's streaming parser to create a stream.
One possibility would be to proceed in two steps (jq --stream .... | jq -n ...
), but it might be better to do everything in one step (jq -cn --stream ...
), using the following "main" program:
distinct(fromstream(inputs
| (.[0] |= .[1:] )
| select(. != [[]])))
Upvotes: 1
Reputation: 4882
With field1
, I doubt you are getting anything in the output, since there is no key/field with the given name. If you simply change your command to jq -s 'unique_by(.caller_id)'
it will give you desired result containing unique & sorted objects based on caller_id
key. It will ensure in result you have atleast & atmost one object for each caller_id
.
NOTE: Same as what @Jeff Mercado has explained in the comments.
Upvotes: 1