Reputation: 721
I have two fairly big list of dictionaries, the length of both is around 1 million dictionaries. What I want to do, is to compare both lists and to detect whether the list1 has any dictionaries that are not present in the list2. I am using the following code to achieve this:
def compare_lists(list1, list2):
new_items = [i for i in list1 if i not in list2]
return new_items
It works as intended, but the problem is, it is very slow - with the length of both lists, it takes over an hour to run the comparison.
Is there a way to make it run quicker? I have to compare full dictionaries, not only certain items, as each key:value pair can possibly differ across the two lists.
Upvotes: 1
Views: 2873
Reputation: 17156
Approach
Using the idea from this answer Convert a list of dictionaries into a set of dictionaries
Code
from json import dumps, loads
def find_difference(lst1, lst2):
# Returns elements in lst1 which are not in lst2
set1 = dics_to_set(lst1)
set2 = dics_to_set(lst2)
# Deserialize elements in set1 that are not in set2
return [loads(x) for x in set1.difference(set2)] # items in set1 that are not in set2
def dics_to_set(lst):
'''
Convert list of dicts to set
'''
return set(dumps(x, sort_keys=True) for x in lst) # sort_keys to control order of keys
Performance
Summary
Test setup:
Test Code
def rand_dicts(n):
'''
Create random dictionary of n elements
'''
mydict = {}
for i in range(n):
mydict[f'key{i}'] = randrange(100)
return mydict
# List of random dictionaries with 5 elements each
lst2 = [rand_dicts(5) for _ in range(100000)]
# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)]
Timing using timeit module
# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2]
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 6