Aleister Crowley
Aleister Crowley

Reputation: 721

Quick way to compare two big lists of dictionaries?

I have two fairly big list of dictionaries, the length of both is around 1 million dictionaries. What I want to do, is to compare both lists and to detect whether the list1 has any dictionaries that are not present in the list2. I am using the following code to achieve this:

def compare_lists(list1, list2):
    new_items = [i for i in list1 if i not in list2]
    return new_items

It works as intended, but the problem is, it is very slow - with the length of both lists, it takes over an hour to run the comparison.

Is there a way to make it run quicker? I have to compare full dictionaries, not only certain items, as each key:value pair can possibly differ across the two lists.

Upvotes: 1

Views: 2873

Answers (1)

DarrylG
DarrylG

Reputation: 17156

Approach

Using the idea from this answer Convert a list of dictionaries into a set of dictionaries

  • Dictionaries in two lists are serialized to string (using json)
  • Serialized dictionaries are placed into two sets corresponding to each list
  • Difference is computed between the two sets
  • Find new elements using set difference
  • Deserialize elements in the set different to get list of new elements

Code

from json import dumps, loads

def find_difference(lst1, lst2):
    # Returns elements in lst1 which are not in lst2
    set1 = dics_to_set(lst1)
    set2 = dics_to_set(lst2)
    
    # Deserialize elements in set1 that are not in set2
    return [loads(x) for x in set1.difference(set2)]  # items in set1 that are not in set2

def dics_to_set(lst):
    '''
        Convert list of dicts to set
    '''
    return set(dumps(x, sort_keys=True) for x in lst)  # sort_keys to control order of keys

Performance

Summary

  • With 100K values 2600X faster (2.06 s vs. 9min 3s)

Test setup:

  • List 2: 100K random dictionaries each with 5 keys
  • List 1: copy of List 2 plus one additional random dictionary

Test Code

def rand_dicts(n):
    '''
        Create random dictionary of n elements
    '''
    mydict = {}
    for i in range(n):
        mydict[f'key{i}'] = randrange(100)

    return mydict

# List of random dictionaries with 5 elements each
lst2 =  [rand_dicts(5) for _ in range(100000)]

# Copy of list 2 with one more random dictionary added
lst1 = lst2 + [rand_dicts(1)] 

Timing using timeit module

# Test of Posted Code
%timeit [x for x in lst1 if x not in lst2] 
# Output: 9min 3s ± 13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Test of proposed code
%timeit find_difference(lst1, lst2)
Output: 2.06 s ± 90.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 6

Related Questions