python-coder
python-coder

Reputation: 2148

How to find unique values in a large JSON file?

I've 2 json files of size data_large(150.1mb) and data_small(7.5kb). The content inside each file is of type [{"score": 68},{"score": 78}]. I need to find the list of unique scores from each file.

While dealing with data_small, I did the following and I was able to view its content with 0.1 secs.

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

But while dealing with data_large, I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. It took around 2 mins to print its content.

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

How can I increase the efficiency of the program while dealing with large data-sets?

Upvotes: 6

Views: 18934

Answers (2)

miki725
miki725

Reputation: 27861

Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like:

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively.

Upvotes: 7

vinod
vinod

Reputation: 2368

If you want to iterate over the JSON file in smaller chunks to preserve RAM, I suggest the approach below, based on your comment that you did not want to use ijson to do this. This only works because your sample input data is so simple and consists of an array of dictionaries with one key and one value. It would get complicated with more complex data, and I would go with an actual JSON streaming library at that point.

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores

Upvotes: 0

Related Questions