Reputation: 2148
I've 2 json files of size data_large(150.1mb)
and data_small(7.5kb)
. The content inside each file is of type [{"score": 68},{"score": 78}]
. I need to find the list of unique scores from each file.
While dealing with data_small, I did the following and I was able to view its content with 0.1 secs
.
with open('data_small') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
But while dealing with data_large, I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. It took around 2 mins
to print its content.
with open('data_large') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
How can I increase the efficiency of the program while dealing with large data-sets?
Upvotes: 6
Views: 18934
Reputation: 27861
Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like:
with open('data_large') as f:
content = json.load(f)
# do not print content since it prints it to stdout which will be pretty slow
# get the unique values
values = set()
for item in content:
values.add(item['score'])
# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])
# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
# json cant serialize sets hence conversion to list
json.dump(list(values), fid)
If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively.
Upvotes: 7
Reputation: 2368
If you want to iterate over the JSON file in smaller chunks to preserve RAM, I suggest the approach below, based on your comment that you did not want to use ijson to do this. This only works because your sample input data is so simple and consists of an array of dictionaries with one key and one value. It would get complicated with more complex data, and I would go with an actual JSON streaming library at that point.
import json
bytes_to_read = 10000
unique_scores = set()
with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
# Find indices of dictionaries in chunk
if '{' not in chunk:
break
opening = chunk.index('{')
ending = chunk.rindex('}')
# Load JSON and set scores.
score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
for s in score_dicts:
unique_scores.add(s.values()[0])
# Read next chunk from last processed dict.
f.seek(-(len(chunk) - ending) + 1, 1)
chunk = f.read(bytes_to_read)
print unique_scores
Upvotes: 0