Quick way to compare two big lists of dictionaries?

Question

I have two fairly big list of dictionaries, the length of both is around 1 million dictionaries. What I want to do, is to compare both lists and to detect whether the list1 has any dictionaries that are not present in the list2. I am using the following code to achieve this:

def compare_lists(list1, list2):
    new_items = [i for i in list1 if i not in list2]
    return new_items

It works as intended, but the problem is, it is very slow - with the length of both lists, it takes over an hour to run the comparison.

Is there a way to make it run quicker? I have to compare full dictionaries, not only certain items, as each key:value pair can possibly differ across the two lists.

DarrylG · Accepted Answer

Approach

Using the idea from this answer Convert a list of dictionaries into a set of dictionaries

Dictionaries in two lists are serialized to string (using json)
Serialized dictionaries are placed into two sets corresponding to each list
Difference is computed between the two sets
Find new elements using set difference
Deserialize elements in the set different to get list of new elements

Code

from json import dumps, loads

def find_difference(lst1, lst2):
    # Returns elements in lst1 which are not in lst2
    set1 = dics_to_set(lst1)
    set2 = dics_to_set(lst2)
    
    # Deserialize elements in set1 that are not in set2
    return [loads(x) for x in set1.difference(set2)]  # items in set1 that are not in set2

def dics_to_set(lst):
    '''
        Convert list of dicts to set
    '''
    return set(dumps(x, sort_keys=True) for x in lst)  # sort_keys to control order of keys

Performance

Summary

With 100K values 2600X faster (2.06 s vs. 9min 3s)

Test setup:

List 2: 100K random dictionaries each with 5 keys
List 1: copy of List 2 plus one additional random dictionary

Test Code

def rand_dicts(n):
    '''
        Create random dictionary of n elements
    '''
    mydict = {}
    for i in range(n):
        mydict[f'key{i}'] = randrange(100)

    return mydict

# List of random dictionaries with 5 elements each
lst2 =  [rand_dicts(5) for _ in range(100000)]

# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)]

Timing using timeit module

# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2] 
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quick way to compare two big lists of dictionaries?

Answers (1)

Related Questions