Reputation: 251
I had a quick question about efficiency of searching through large dictionaries in Python. I am reading a large comma-separated file and getting a key and value from each line. If my key is already in the dictionary, I'm adding the value to the value listed in the dictionary, if the key doesn't exist in the dictionary, I simply add the value. Previously I was using this:
if key in data_dict.keys():
add values
else:
data_dict[key] = value
This starts pretty fast, but as the dictionary grows it becomes slower and slower, to the point where I can't use it at all. I changed the way I search for the key in the dictionary to this:
try:
# This will fail if key not present
data_dict[keyStr] = input_data[keyStr] + load_val
except:
data_dict[keyStr] = load_val
This is infinitely faster, and can read / write over 350,000 lines of code in 3 seconds.
My question was why does the if key in data_dict.keys():
command take sooo much longer than the call to try: data_dict[keyStr]
? And why wouldn't Python utilize the try
statement when searching for a key in a dictionary?
Upvotes: 24
Views: 76704
Reputation: 123491
As several others have noted, the problem lies in the fact that key in data_dict.keys()
uses the unordered list
returned from the keys()
method (in Python 2.x), which takes linear time O(n) to search through, meaning that the running time increases linearly with the dictionary's size, plus generating the list of keys itself will take longer and longer as the size goes up.
On the other hand, key in data_dict
only requires constant time O(1), on average, to perform a search regardless of the size of the dictionary because internally it does a hash table look-up. In addition, this hash table already exists since its part of the internal representation of dictionaries, and therefore doesn't have to be generated before using it.
Python doesn't do this automatically because the in
operator only knows the type of its two operands, not their sources, so it can't automatically optimize the first case where all it sees is the key and a list.
However, in this case the issue of search speed can probably be avoided altogether by storing the data in a specialized version of a dictionary called a defaultdict
found in the built-in collections
module. Here's how your code might look if you used one:
from collections import defaultdict
input_data = defaultdict(float) # (guessing factory type)
...
data_dict[keyStr] = input_data[keyStr] + load_val
When there's no preexisting entry for input_data[keyStr]
one will be generated automatically with a default value (0.0
for float
in this example). As you can see, the code is shorter and very likely faster, all without the need for any if
tests or exception handling.
Upvotes: 5
Reputation: 859
As an extra analysis, I did a simple performance test to see how try/except method mentioned in the question compares with the proposed solution of using "if key in data_dict" instead of "if key in data_dict.keys()" (I am using Python 3.7):
import timeit
k = '84782005' # this keys exists in the dictionary
def t1():
if k in data_dict:
pass
def t2():
if k in data_dict.keys():
pass
def t3():
try:
a = data_dict[k]
except:
pass
print(timeit.timeit(t1,number= 100000))
print(timeit.timeit(t2,number= 100000))
print(timeit.timeit(t3,number= 100000))
>> 0.01741484600097465
>> 0.025949209000827977
>> 0.017266065000512754
For the key that already exists in the dictionary, the search time seems to be the same for try/except and offered solution. However, if the key does not exist:
k = '8' # this keys does NOT exist in the dictionary
def t1():
if k in data_dict:
pass
def t2():
if k in data_dict.keys():
pass
def t3():
try:
a = data_dict[k]
except:
pass
print(timeit.timeit(t1,number= 100000))
print(timeit.timeit(t2,number= 100000))
print(timeit.timeit(t3,number= 100000))
>> 0.014406295998924179
>> 0.0236777299996902
>> 0.035819852999338764
The exception seems to take much more time than even using '.keys()'! So, I second the solution proposed by Mark.
Upvotes: 1
Reputation: 165
There is something akin to the try function that should help you:
dict.get(key, default)
data_dict[keyStr] = data_dict.get(keyStr, '') + load_val
Upvotes: 1
Reputation: 1415
Back in the old days we used setdefault
:
data_dict.setdefault(keyStr, []).append(load_val)
Upvotes: 2
Reputation: 3181
This is because data_dict.keys()
returns a list containing the keys in the dictionary (at least in Python 2.x). Which, in order to find if a key is in the list, requires a linear search.
Whereas, trying to access an element of the dict directly takes advantage of the awesome properties of dictionaries so the access is almost instantaneous.
Upvotes: 3
Reputation: 308392
The problem is that for every test you're generating a new list of keys with .keys()
. As the list of keys gets longer, the time required goes up. Also as noted by dckrooney, the search for the key becomes linear instead of taking advantage of the hash-table structure of the dictionary.
Replace with:
if key in data_dict:
Upvotes: 37
Reputation: 12401
You can also simply use
if key in data_dict:
instead of
if key in data_dict.keys():
As mentioned, the first is a direct hash lookup - the intended offset is computed directly, and then checked - it is roughly O(1), whereas the check of keys is a linear search, which is O(n).
In [258]: data_dict = dict([(x, x) for x in range(100000)])
In [259]: %timeit 999999 in data_dict.keys()
100 loops, best of 3: 3.47 ms per loop
In [260]: %timeit 999999 in data_dict
10000000 loops, best of 3: 49.3 ns per loop
Upvotes: 7
Reputation: 3121
data_dict.keys()
returns an unsorted list of keys in the dictionary. Thus each time you check to see if a given key is in the dictionary, you're doing a linear search across the list of keys (an O(n) operation). The longer your list, the longer it takes to search for a given key.
Contrast this to data_dict[keyStr]
. This performs a hash lookup, which is an O(1) operation. It doesn't (directly) depend on the number of keys in the dictionary; even as you add more keys, the time to check if a given key is in the dictionary stays constant.
Upvotes: 10
Reputation: 18531
This doesn't answer the question, but rather avoids it. Try using collections.defaultdict
. You won't need the if/else
or try/except
.
from collections import defaultdict
data_dict = defaultdict(list)
for keyStr, load_val in data:
data_dict[keyStr].append(load_val)
Upvotes: 4