Eli
Eli

Reputation: 38949

Load a Single Large Python Dictionary Encoded as Json Without Killing Memory Usage?

I've seen a lot of similar questions to this, but nothing that really matched. Most other questions seemed to relate to speed. What I'm experiencing is a single json dictionary that sits in a 1.1gig file on my local box taking up all of my 16 gigabytes of memory when I try to load it using anything along the lines of:

f = open(some_file, "rb")
new_dictionary = json.load(f)

This happens regardless of what json library I use (I've tried ujson, json, yajl), and regardless of whether I read things in as a byte stream or not. This makes absolutely no sense to me. What's with the crazy memory usage, and how do I get around it?

In case it helps, the dictionary is just a bunch of nested dictionaries all having ints point to other ints. A sample looks like:

{"0":{"3":82,"4":503,"15":456},"956":{"56":823,"678":50673,"35":1232}...}

UPDATE: When I run this with simplejson, it actually only takes up 8 gigs. No idea why that one takes up so much less than all the others.

UPDATE 2: So I did some more investigation. I loaded up my dictionary with simplejson, and tried converting all the keys to ints (per Liori's suggestion that strings might take up more space). Space stayed the same at 8 gigs. Then I tried Winston Ewert's suggestion of running a gc.collect(). Space still remained at 8 gigs. Finally, annoyed and curious, I pickled my new data structure, exited Python, and reloaded. Lo and behold, it still takes up 8 gigs. I guess Python just wants that much space for a big 2d dictionary. Frustrating, for sure, but at least now I know it's not a JSON problem so long as I use simplejson to load it.

Upvotes: 7

Views: 2675

Answers (4)

Eli
Eli

Reputation: 38949

Gabe really figured this out in a comment, but since it's been a few months and he hasn't posted it as an answer, I figured I should just answer my own question, so posterity sees that there is an answer.

Anyway, the answer is that a 2d dictionary just takes up that much space in Python. Each one of those dictionaries winds up with some space overhead, and since there are a lot of them, it balloons up from 1.1 gig to 8 gigs, and there's nothing you can do about it except try to use a different data structure or get more ram.

Upvotes: 0

Winston Ewert
Winston Ewert

Reputation: 45039

A little experimentation on my part suggests that calling gc.collect() after the json object has been parsed drops memory usage to where it was when the object was originally constructed.

Here is the results I get for memory usage on a smaller scale:

Build. No GC
762912
Build. GC
763000
Standard Json. Unicode Keys. No GC
885216
Standard Json. Unicode Keys. GC
744552
Standard Json. Int Keys. No GC
885216
Standard Json. Int Keys. GC
744724
Simple Json. Unicode Keys. No GC
894352
Simple Json. Unicode Keys. GC
745520
Simple Json. Int Keys. No GC
894352
Simple Json. Int Keys. GC
744884

Basically, running gc.collect() appears to cleanup some sort of garbage producing during the JSON parsing process.

Upvotes: 2

Marco Mariani
Marco Mariani

Reputation: 13766

You could try with a streaming API:

http://lloyd.github.com/yajl/

of which there are a couple of python wrappers.

https://github.com/rtyler/py-yajl/

https://github.com/pykler/yajl-py

Upvotes: 3

guidoism
guidoism

Reputation: 8158

I can't believe I'm about to say this but json is actually a very simple format, it wouldn't be too difficult to build your own parser.

That said, it would only makes sense if:

  • You don't need the full dictionary at the end (i.e., you can consume the data as you read it)
  • You have a good idea what sort of structure the data is in (an arbitrarily deep dictionary would make this much more difficult)

Upvotes: 0

Related Questions