Reputation: 38949
I've seen a lot of similar questions to this, but nothing that really matched. Most other questions seemed to relate to speed. What I'm experiencing is a single json dictionary that sits in a 1.1gig file on my local box taking up all of my 16 gigabytes of memory when I try to load it using anything along the lines of:
f = open(some_file, "rb")
new_dictionary = json.load(f)
This happens regardless of what json library I use (I've tried ujson, json, yajl), and regardless of whether I read things in as a byte stream or not. This makes absolutely no sense to me. What's with the crazy memory usage, and how do I get around it?
In case it helps, the dictionary is just a bunch of nested dictionaries all having ints point to other ints. A sample looks like:
{"0":{"3":82,"4":503,"15":456},"956":{"56":823,"678":50673,"35":1232}...}
UPDATE: When I run this with simplejson, it actually only takes up 8 gigs. No idea why that one takes up so much less than all the others.
UPDATE 2: So I did some more investigation. I loaded up my dictionary with simplejson, and tried converting all the keys to ints (per Liori's suggestion that strings might take up more space). Space stayed the same at 8 gigs. Then I tried Winston Ewert's suggestion of running a gc.collect(). Space still remained at 8 gigs. Finally, annoyed and curious, I pickled my new data structure, exited Python, and reloaded. Lo and behold, it still takes up 8 gigs. I guess Python just wants that much space for a big 2d dictionary. Frustrating, for sure, but at least now I know it's not a JSON problem so long as I use simplejson to load it.
Upvotes: 7
Views: 2675
Reputation: 38949
Gabe really figured this out in a comment, but since it's been a few months and he hasn't posted it as an answer, I figured I should just answer my own question, so posterity sees that there is an answer.
Anyway, the answer is that a 2d dictionary just takes up that much space in Python. Each one of those dictionaries winds up with some space overhead, and since there are a lot of them, it balloons up from 1.1 gig to 8 gigs, and there's nothing you can do about it except try to use a different data structure or get more ram.
Upvotes: 0
Reputation: 45039
A little experimentation on my part suggests that calling gc.collect()
after the json object has been parsed drops memory usage to where it was when the object was originally constructed.
Here is the results I get for memory usage on a smaller scale:
Build. No GC
762912
Build. GC
763000
Standard Json. Unicode Keys. No GC
885216
Standard Json. Unicode Keys. GC
744552
Standard Json. Int Keys. No GC
885216
Standard Json. Int Keys. GC
744724
Simple Json. Unicode Keys. No GC
894352
Simple Json. Unicode Keys. GC
745520
Simple Json. Int Keys. No GC
894352
Simple Json. Int Keys. GC
744884
Basically, running gc.collect() appears to cleanup some sort of garbage producing during the JSON parsing process.
Upvotes: 2
Reputation: 13766
You could try with a streaming API:
of which there are a couple of python wrappers.
https://github.com/rtyler/py-yajl/
https://github.com/pykler/yajl-py
Upvotes: 3
Reputation: 8158
I can't believe I'm about to say this but json is actually a very simple format, it wouldn't be too difficult to build your own parser.
That said, it would only makes sense if:
Upvotes: 0