mlissner
mlissner

Reputation: 18206

Performance of Large Data Structures in Python

I'm looking for some help understanding the performance characteristics of large lists, dicts or arrays in Python. I have about 1M key value pairs that I need to store temporarily (this will grow to maybe 10M over the next year). They keys are database IDs ranging from 0 to about 1.1M (with some gaps) and the values are floats.

I'm calculating pagerank, so my process is to initialize each ID with a value of 1, then look it up in memory and update it about ten times before saving it back to the database.

  1. I'm theorizing that lists or arrays will be fastest if I use the database ID as the index of the array/list. This will create a gappy data structure, but I don't understand how fast look ups or updates will be. I also don't yet understand if there's a big gain to get from using arrays instead of lists.

  2. Using a dict for this is very natural, with key-value pairs, but I get the impression building the dict the first time would be very slow and memory intensive as it grows to accommodate all the entries.

  3. I also read that SQLite might be a good solution for this using the :memory: flag, but I haven't dug into that too much yet.

Anyway, just looking for some guidance here. Any thoughts would be much appreciated as I'm digging in.

Upvotes: 3

Views: 5145

Answers (3)

aychedee
aychedee

Reputation: 25619

Just start with a dictionary. Even if you are running on WinXP 10 million keys shouldn't be a problem. But I hope for your sake that you aren't :)

A dictionary will be easier to code and probably faster to build and update especially if you are updating the values in random order.

It's often best to start coding a prototype and use that to identify performance issues. Your bottleneck will most likely be wherever you are requesting the data from. Not entering or retrieving it from a dictionary.

Upvotes: 4

Marcin
Marcin

Reputation: 49886

Well, in general, if you have too much data to keep in memory, you need to use some kind of external storage; and if all your data does fit in memory, you don't need to do anything fancy.

The biggest problem you're likely to have is if you have more data than your operating system will allow in a single process image; in that case again you will need external storage.

In both cases this comes down to: use a database, whether sql or no. If it's a sql database, you might like to use an ORM to make that easier.

However, until you hit this problem, just store everything in memory, and serialise to disk. I suggest using cPickle or an ORM+sqlite.

Upvotes: 2

leeladam
leeladam

Reputation: 1758

Looking up data takes O(1) time in a dictionary thanks to the built-in hashing of keys. Of course for a large amount of data, there will be collisions that take linear time to resolve, but dicts with 10M items should work fine. Do not search for data in long lists, because that will take linear (O(n)) time.

However, consider using numpy depending on what you plan to do with your data. Only to store & retrieve, dicts are perfect, but calculations with tons of data can be largely accelerated using numpy's vectorization instead of using loops.

SQL comes into sight when you need to do more complicated queries (search for multiple keys or define conditions to match). For a simple key-value pair, SQL seems to be overkill.

Upvotes: 2

Related Questions