Reputation: 749
I have a dictionary "update" algorithm that I suspect is not the most efficient way. As I run my program and continuously add a new dictionary to my existing dictionary, the performance slows down significantly over time. I'd like to find a more performant way.
I have a loop where each iteration processes a file, and results in a "dict of lists of dicts". Each main dict key has a list value, the items of which are themselves dicts, of which there can be multiple. In this example, the list belonging to B has two dicts in it. I might process the first file and get this result:
{'A': [{'filename': 6311, 'id': 6634, 'num_transactions': 4969, 'total': 7808}],
'B': [{'filename': 6311, 'id': 3578, 'type': 8268, 'diameter': 2281, 'width': 4617},
{'filename': 6311, 'id': 2289, 'type': 1553, 'diameter': 4104, 'width': 8725}]}
Then I might process another file and get this:
{'C': [{'filename': 7775, 'id': 177, 'count': 6139, 'needed': 7905}],
'B': [{'filename': 7775, 'id': 7540, 'type': 9854, 'diameter': 3729, 'width': 9145},
{'filename': 7775, 'id': 27, 'type': 2380, 'diameter': 7209, 'width': 6023}]}
I then combine these dicts into a master dict where I continuously combine the lists based on their key value. The combination of the above two dicts would result in (order here is arbitrary, but sorted for readability):
{'A': [{'filename': 6311, 'id': 6634, 'num_transactions': 4969, 'total': 7808}],
'B': [{'filename': 6311, 'id': 3578, 'type': 8268, 'diameter': 2281, 'width': 4617},
{'filename': 6311, 'id': 2289, 'type': 1553, 'diameter': 4104, 'width': 8725},
{'filename': 7775, 'id': 7540, 'type': 9854, 'diameter': 3729, 'width': 9145},
{'filename': 7775, 'id': 27, 'type': 2380, 'diameter': 7209, 'width': 6023}],
'C': [{'filename': 7775, 'id': 177, 'count': 6139, 'needed': 7905}]}
Note that I must have a final master_dict that contains the combined data across all of my dictionaries, this is non-negotiable.
Below is a complete program to generate random cur_dicts
and continuously add their results to the master_dict
. The function add_to_master_dict()
represents my update algorithm.
import random
import timeit
import matplotlib.pyplot as plt
random.seed(0)
a_keys = ['id', 'num_transactions', 'total']
b_keys = ['id', 'type', 'diameter', 'width']
c_keys = ['id', 'count', 'needed']
key_dict = {'A':a_keys, 'B':b_keys, 'C':c_keys}
def generate_cur_dict(key_dict):
cur_dict = {}
filename_int = random.randint(0, 10000)
for main in random.sample(key_dict.keys(),
random.randint(1, len(key_dict.keys()))):
cur_dict[main] = []
num_rows = random.choice([1, 1, random.randint(1, 3)])
for _ in range(num_rows):
temp_dict = {}
temp_dict['filename'] = filename_int
for k in key_dict[main]:
temp_dict[k] = random.randint(0, 10000)
cur_dict[main].append(temp_dict)
return cur_dict
# Hacky use of variable scope by assuming existence of cur/master_dict,
# but easiest way to pass to timeit
def add_to_master_dict():
if not master_dict: # master_dict is empty
master_dict.update(cur_dict)
else:
for k in cur_dict.keys():
if k in master_dict:
# In case of None value rather than a list
if cur_dict[k] is None:
continue
else:
# Combine the two lists based on key
master_dict[k] = master_dict[k] + cur_dict[k]
else:
# If key not in master dict, just add the cur_dict value to the
# master_dict
master_dict[k] = cur_dict[k]
master_dict = {}
times = []
for i in range(50001):
cur_dict = generate_cur_dict(key_dict)
times.append(timeit.timeit(add_to_master_dict, number=1))
# Easy visual way to see how much it slows down over time
if i % 1000 == 0:
print(i)
plt.figure(figsize=(10, 6))
plt.plot(times)
I know this isn't the most elegant way to use timeit - I'm not taking averages of execution, so there's a lot of variation - but I'm just trying to demo the concept. It should be clear that if you run this for any significant number of iterations, add_to_master_dict()
gets bogged down quite a bit, so I'm likely looking at exponential growth here for my updates.
Any suggestions for how I might perform my update operation in a way that (hopefully) achieves linear time? I've been able to find dict/list update algorithms that perform well in simple cases, but nothing for my dict of lists use case.
Upvotes: 1
Views: 56
Reputation: 55799
This line
master_dict[k] = master_dict[k] + cur_dict[k]
Creates a new list every time it's executed. Extending the existing list
master_dict[k] += cur_dict[k]
is much faster. On my machine the execution time went from 1 min 46.857 seconds to 8.027 seconds.
I'm not an expert, but I suspect both versions of the code run in roughly* linear time. However in the original code a new list of length n + k must be constructed for each execution of the line, whereas in the improved version an existing list is extended by k elements, which requires less memory allocation and object construction.
* Extending a list runs in amortized linear time - see https://wiki.python.org/moin/TimeComplexity
Upvotes: 1