lonewarrior556
lonewarrior556

Reputation: 4479

Dictionary, set or frozenset?

I have a large collection of data, about 10 million entries and part of my program required very many membership checks...

if a in data:
    return True
return False

right now I have data as dictionary entries with all their values equal to '1'

I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...

For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?

And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable? Is a dictionary with 1 billion entries still fast?

Upvotes: 10

Views: 1704

Answers (3)

Zelda
Zelda

Reputation: 189

There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB

For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed

Upvotes: 2

yuvalb9
yuvalb9

Reputation: 21

For this amount of data RyPeck is right - a DB will do the job much better.

One more point: Something seems odd to me in what you've written: If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.

Upvotes: 2

RyPeck
RyPeck

Reputation: 8147

On Principal

If you expect the data to keep growing you can't use a frozenset.

A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.

Practically speaking...

When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.

Upvotes: 7

Related Questions