Reputation: 4479
I have a large collection of data, about 10 million entries and part of my program required very many membership checks...
if a in data:
return True
return False
right now I have data as dictionary entries with all their values equal to '1'
I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...
For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?
And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable? Is a dictionary with 1 billion entries still fast?
Upvotes: 10
Views: 1704
Reputation: 189
There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB
For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed
Upvotes: 2
Reputation: 21
For this amount of data RyPeck is right - a DB will do the job much better.
One more point: Something seems odd to me in what you've written: If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.
Upvotes: 2
Reputation: 8147
On Principal
If you expect the data to keep growing you can't use a frozenset.
A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.
Practically speaking...
When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.
Upvotes: 7