Hisham
Hisham

Reputation: 65

jq to remove one of the duplicated objects

I have a json file like this:

{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"234432","cust_name":"ghi"}
{"caller_id":"123321","cust_name":"abc"}
....

I tried:

jq -s 'unique_by(.field1)' 

but this will remove all the duplicated items, I,m looking to keep just one of the duplicated items, to get the file like this:

{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"234432","cust_name":"ghi"}
....

Upvotes: 1

Views: 2633

Answers (2)

peak
peak

Reputation: 116880

If the file consists of a sequence (stream) of JSON objects, then a very simple way to produce a stream of the distinct objects would be to use the invocation:

jq -s `unique[]`

A similar alternative would be:

jq -n `[inputs] | unique[]`

For large files, however, the above will probably be too inefficient, both with respect to RAM and run-time. Note that both unique and unique_by entail a sort.

A far better alternative would be to take advantage of the fact that the input is a stream, and to avoid the built-in unique and unique_by filters. This can be done with the assistance of the following filters, which are not yet built-in but likely to become so:

# emit a dictionary
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);

# distinct entities in the stream s
def distinct(s): set(s)[];

We now have only to add:

distinct(inputs)

to achieve the objective, provided jq is invoked with the -n command-line option.

This approach will also preserve the original ordering.

If the input is an array ...

If the input is an array, then using distinct as defined above still has the advantage of not requiring a sort. For arrays that are too large to fit comfortably in memory, it would be advisable to use jq's streaming parser to create a stream.

One possibility would be to proceed in two steps (jq --stream .... | jq -n ...), but it might be better to do everything in one step (jq -cn --stream ...), using the following "main" program:

distinct(fromstream(inputs 
                    | (.[0] |= .[1:] )
                    | select(. != [[]]))) 

Upvotes: 1

Amith Kumar
Amith Kumar

Reputation: 4882

With field1, I doubt you are getting anything in the output, since there is no key/field with the given name. If you simply change your command to jq -s 'unique_by(.caller_id)' it will give you desired result containing unique & sorted objects based on caller_id key. It will ensure in result you have atleast & atmost one object for each caller_id.

NOTE: Same as what @Jeff Mercado has explained in the comments.

Upvotes: 1

Related Questions