Reputation: 13
I'm trying to find all json objects in my jsonl file that contain the same identifier value.
So if my data look like:
{
"data": {
"value": 42,
"url": "url.com",
"details": {
"timestamp": "07:32:29",
"identifier": "123ABC"
}
},
"message": "string"
}
I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.
This part works as intended:
import gzip
import json_lines
import jsonlines
from itertools import groupby
identifiers=set()
duplicates=[]
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in identifiers:
duplicates.append(item)
else:
identifiers.add(ID)
dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}
But when I read through the file a second time:
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in dup_IDs:
duplicates.append(item)
dup_IDs.remove(ID)
else:
continue
if len(dup_IDs)==0:
break
else:
continue
It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.
Upvotes: 1
Views: 512
Reputation: 1094
import gzip
import json_lines
import jsonlines
from itertools import groupby
duplicates=[]
nb = {}
i = 0
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
ID = item["data"]["details"]["identifier"]
if ID in nb:
if ID not in b:
nb[ID]=int(i)
else:
nb[ID]=str(i)
i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
for item in f:
if i in k:
duplicates.append(item)
i +=1
print(duplicates)
Upvotes: -1
Reputation: 37
If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.
Upvotes: 2