Faenatek
Faenatek

Reputation: 13

Python: finding duplicates in large jsonl file

I'm trying to find all json objects in my jsonl file that contain the same identifier value.

So if my data look like:

{
   "data": {
      "value": 42,
      "url": "url.com",
      "details": {
         "timestamp": "07:32:29",
         "identifier": "123ABC"
         }
      },
   "message": "string"
}

I want to find every object that has the same identifier value. The file is too large to load all at once, so instead I check line by line and store just the identifier values. This has the drawback of missing the first object that has that identifier (ie, if objects A, B, and C all have the same identifier, I would only end up with B and C saved). To find the first occurrence of the identifier, I try reading through the file a second time to pick up only the first time each duplicate identifier is found. This is where I encounter some problems.

This part works as intended:

import gzip
import json_lines
import jsonlines
from itertools import groupby

identifiers=set()
duplicates=[]

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in identifiers:
            duplicates.append(item)
        else:
            identifiers.add(ID)

dup_IDs={dup["data"]["details"]["identifier"] for dup in duplicates}

But when I read through the file a second time:

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in dup_IDs:
            duplicates.append(item)
            dup_IDs.remove(ID)
        else:
            continue

        if len(dup_IDs)==0:
            break
        else:
            continue

It runs for ~30 minutes and eventually crashes my computer. I'm assuming (hoping) this is because there's a problem with my code and not my computer because the code is easier to fix.

Upvotes: 1

Views: 512

Answers (2)

Narcisse Doudieu Siewe
Narcisse Doudieu Siewe

Reputation: 1094

import gzip
import json_lines
import jsonlines
from itertools import groupby

duplicates=[]
nb = {}
i = 0

with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        ID = item["data"]["details"]["identifier"]
        if ID in nb:
           if ID not in b:
               nb[ID]=int(i)
        else:
            nb[ID]=str(i)
        i +=1
i = 0
k = set(nb[i] for i in nb if isinstance(nb[i], int))
del nb
with json_lines.open('file.jsonlines.gz') as f:
    for item in f:
        if i in k:
           duplicates.append(item)
        i +=1
print(duplicates)

Upvotes: -1

lucas rshane
lucas rshane

Reputation: 37

If the file size is too large, I'd suggest to upload data into SQL database and use SQL queries to filter what you needed.

Upvotes: 2

Related Questions