Brumdog22
Brumdog22

Reputation: 251

Efficient Dictionary Searching?

I had a quick question about efficiency of searching through large dictionaries in Python. I am reading a large comma-separated file and getting a key and value from each line. If my key is already in the dictionary, I'm adding the value to the value listed in the dictionary, if the key doesn't exist in the dictionary, I simply add the value. Previously I was using this:

if key in data_dict.keys():
    add values
else:
    data_dict[key] = value

This starts pretty fast, but as the dictionary grows it becomes slower and slower, to the point where I can't use it at all. I changed the way I search for the key in the dictionary to this:

try:
    # This will fail if key not present
    data_dict[keyStr] = input_data[keyStr] + load_val
except:
    data_dict[keyStr] = load_val

This is infinitely faster, and can read / write over 350,000 lines of code in 3 seconds.

My question was why does the if key in data_dict.keys(): command take sooo much longer than the call to try: data_dict[keyStr]? And why wouldn't Python utilize the try statement when searching for a key in a dictionary?

Upvotes: 24

Views: 76704

Answers (9)

martineau
martineau

Reputation: 123491

As several others have noted, the problem lies in the fact that key in data_dict.keys() uses the unordered list returned from the keys() method (in Python 2.x), which takes linear time O(n) to search through, meaning that the running time increases linearly with the dictionary's size, plus generating the list of keys itself will take longer and longer as the size goes up.

On the other hand, key in data_dict only requires constant time O(1), on average, to perform a search regardless of the size of the dictionary because internally it does a hash table look-up. In addition, this hash table already exists since its part of the internal representation of dictionaries, and therefore doesn't have to be generated before using it.

Python doesn't do this automatically because the in operator only knows the type of its two operands, not their sources, so it can't automatically optimize the first case where all it sees is the key and a list.

However, in this case the issue of search speed can probably be avoided altogether by storing the data in a specialized version of a dictionary called a defaultdict found in the built-in collections module. Here's how your code might look if you used one:

from collections import defaultdict

input_data = defaultdict(float)  # (guessing factory type)
...
data_dict[keyStr] = input_data[keyStr] + load_val

When there's no preexisting entry for input_data[keyStr] one will be generated automatically with a default value (0.0 for float in this example). As you can see, the code is shorter and very likely faster, all without the need for any if tests or exception handling.

Upvotes: 5

pegah
pegah

Reputation: 859

As an extra analysis, I did a simple performance test to see how try/except method mentioned in the question compares with the proposed solution of using "if key in data_dict" instead of "if key in data_dict.keys()" (I am using Python 3.7):

    import timeit

    k = '84782005' # this keys exists in the dictionary
    def t1():
        if k in data_dict:
            pass
    def t2():
        if k in data_dict.keys():
            pass
    def t3():
        try:
            a = data_dict[k]
        except:
            pass

    print(timeit.timeit(t1,number= 100000))
    print(timeit.timeit(t2,number= 100000))
    print(timeit.timeit(t3,number= 100000))

    >> 0.01741484600097465
    >> 0.025949209000827977
    >> 0.017266065000512754

For the key that already exists in the dictionary, the search time seems to be the same for try/except and offered solution. However, if the key does not exist:

    k = '8' # this keys does NOT exist in the dictionary
    def t1():
        if k in data_dict:
            pass
    def t2():
        if k in data_dict.keys():
            pass
    def t3():
        try:
            a = data_dict[k]
        except:
            pass

    print(timeit.timeit(t1,number= 100000))
    print(timeit.timeit(t2,number= 100000))
    print(timeit.timeit(t3,number= 100000))

    >> 0.014406295998924179
    >> 0.0236777299996902
    >> 0.035819852999338764

The exception seems to take much more time than even using '.keys()'! So, I second the solution proposed by Mark.

Upvotes: 1

SQLesion
SQLesion

Reputation: 165

There is something akin to the try function that should help you: dict.get(key, default)

data_dict[keyStr] = data_dict.get(keyStr, '') + load_val

Upvotes: 1

Christian Heimes
Christian Heimes

Reputation: 1415

Back in the old days we used setdefault:

data_dict.setdefault(keyStr, []).append(load_val)

Upvotes: 2

William Gaul
William Gaul

Reputation: 3181

This is because data_dict.keys() returns a list containing the keys in the dictionary (at least in Python 2.x). Which, in order to find if a key is in the list, requires a linear search.

Whereas, trying to access an element of the dict directly takes advantage of the awesome properties of dictionaries so the access is almost instantaneous.

Upvotes: 3

Mark Ransom
Mark Ransom

Reputation: 308392

The problem is that for every test you're generating a new list of keys with .keys(). As the list of keys gets longer, the time required goes up. Also as noted by dckrooney, the search for the key becomes linear instead of taking advantage of the hash-table structure of the dictionary.

Replace with:

if key in data_dict:

Upvotes: 37

Corley Brigman
Corley Brigman

Reputation: 12401

You can also simply use

if key in data_dict:

instead of

 if key in data_dict.keys():

As mentioned, the first is a direct hash lookup - the intended offset is computed directly, and then checked - it is roughly O(1), whereas the check of keys is a linear search, which is O(n).

In [258]: data_dict = dict([(x, x) for x in range(100000)])

In [259]: %timeit 999999 in data_dict.keys()
100 loops, best of 3: 3.47 ms per loop

In [260]: %timeit 999999 in data_dict
10000000 loops, best of 3: 49.3 ns per loop

Upvotes: 7

dckrooney
dckrooney

Reputation: 3121

data_dict.keys() returns an unsorted list of keys in the dictionary. Thus each time you check to see if a given key is in the dictionary, you're doing a linear search across the list of keys (an O(n) operation). The longer your list, the longer it takes to search for a given key.

Contrast this to data_dict[keyStr]. This performs a hash lookup, which is an O(1) operation. It doesn't (directly) depend on the number of keys in the dictionary; even as you add more keys, the time to check if a given key is in the dictionary stays constant.

Upvotes: 10

wflynny
wflynny

Reputation: 18531

This doesn't answer the question, but rather avoids it. Try using collections.defaultdict. You won't need the if/else or try/except .

from collections import defaultdict

data_dict = defaultdict(list)
for keyStr, load_val in data:
    data_dict[keyStr].append(load_val)

Upvotes: 4

Related Questions