Reputation: 7056
I have a large jsonl file like so:
# source.jsonl
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "k9489", "content": "content goes here"}
{"id": "p48947", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}
I have a banned id list like so:
#banned_list.txt
k9489
p48947
</snip>
I want to now delete the lines where the "id" matches any of the id on the banned list text file. So I am looking for the result:
{"id": "y88979", "content": "content goes here"}
{"id": "h93794", "content": "content goes here"}
{"id": "i8408", "content": "content goes here"}
Python would be too slow to iterate over this jsonl file (20gb) and I see that jq
is the best for doing this but unsure about the syntax which will allow it to take all ids from a list.. :(
Upvotes: 1
Views: 114
Reputation: 150
I will explain how implement a blacklist (your interest) and a whitelist:
$ cat banned_list.txt
k9489
p48947
You can create a simple json-list : (Note: the awk ensures no '\r' chars ):
$ awk '{gsub(/\r/,"",$0);print $0}' banned_list.txt | jq --raw-input '.' | jq '.' -s > list.json
$ cat list.json
[
"k9489",
"p48947"
]
Blacklist implementation:
Select all objects with the ".id" not in the $ban array. Here we read list.json into "black" jq-variable:
$ jq --argfile black list.json 'select(.id|IN($black[])|not) ' source.jsonl -c
{"id":"y88979","content":"content goes here"}
{"id":"h93794","content":"content goes here"}
{"id":"i8408","content":"content goes here"}
Whitelist implementation:
Select all objects with the ".id" in the $ban array. Here we read list.json into "white" jq-variable:
$ jq --argfile white list.json 'select(.id|IN($white[])) ' source.jsonl -c
{"id":"k9489","content":"content goes here"}
{"id":"p48947","content":"content goes here"}
Upvotes: 0
Reputation: 36296
You could read the banned ids using --rawfile
, split it at newline characters, and compare each JSON line read if its .id
is contained:
jq -c --rawfile b banned_list.txt 'select(IN(.id; $b | (. / "\n")[]) | not)' source.jsonl
This, however, would split the list of banned ids over and over on each input line, so it'd be better to prepare it beforehand into a proper JSON array using another call to jq, and --slurpfile
to read the JSON strings into an array:
jq -c --slurpfile b <(jq -R . banned_list.txt) 'select(IN(.id; $b[]) | not)' source.jsonl
Output:
{"id":"y88979","content":"content goes here"}
{"id":"h93794","content":"content goes here"}
{"id":"i8408","content":"content goes here"}
You could even improve on this by sorting the list of banned ids, and use bsearch
for a binary search.
Upvotes: 2