Reputation: 1029
I have written a code to take an input from a file of very big data, perform some simple processing on it and then store it in a shelve dictionary format. I have 41 million entries to process. However, after I write 35 million entries to the shelve dict, performance suddenly drops and eventually completely halts. Any idea what I can do to avoid this?
My data is on twitter and it maps user screen names to their ID's. Like so:
Jack 12
Mary 13
Bob 15
I need to access each of these by name very quickly. Like: when I give my_dict[Jack]
it returns 12
.
Upvotes: 0
Views: 334
Reputation: 77454
Consider using something more low-level. Shelve performance can be quite low, unfortunately. This doesn't explain the slow-down you are seeing though.
For many disk-based indexes it helps if you can initialize them with an expected size, so they do not need to reorganize themselves on the fly. I've seen this with a huge performance impact for on-disk hash-tables in various libraries.
As for your actual goal, have a look at:
http://docs.python.org/library/persistence.html
in particular the gdbm, dbhash, bsddb, dumbdbm
and sqlite3
modules.
sqlite3
is probably not the fastest, but the easiest to use one. After all, it has a command line SQL client. bsddb
probably is faster, in particular if you tune nelem
and similar parameters for you data size. And it does have a lot of language bindings, too; likely even more than sqlite.
Try to create your database with an initial size of 41 million, so it can optimize for this size!
Upvotes: 1