ocp1000
ocp1000

Reputation: 601

Effectively search if part of a tuple exist in a list of tuples

I have a tuple list which contains tuples of 6 digits, ranging from 01 to 99. For example:

tuple_list = {(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)}

For every tuple on this list I need to effectively search if there are any other tuples that have at least 5 numbers in common with it (excluding the searched number). So for the above example, the result will be:

(01,02,03,04,05,06) -> (02,03,04,05,06,99)
(20,22,24,26,28,30) -> []
(02,03,04,05,06,99) -> (01,02,03,04,05,06)

The list itself is big and can hold up to 1,000,000 records.
I tried the naive approach of scanning the list one-by-one, but this has an O(n^2) complexity and takes a lot of time.
I thought about maybe using a dict but I can't find a way to search for part of a key (it would have worked fine if I needed to search for the exact key). Maybe some sort of a suffix/prefix tree variation is needed, but I can't seem to figure it out.

Any help will be appreciated.

Upvotes: 0

Views: 103

Answers (2)

Aziz Sonawalla
Aziz Sonawalla

Reputation: 2502

The code below generates a dict where they key is a 5-tuple and the value is a list of all the tuples that have those 5 elements.

It runs in O(nm) where n is the size of the tuple list and m is the size of each tuple. For 6-tuples, it runs in O(6n). See test results below

def getCombos(tup):
    """
    Produces all combinations of the tuple with 1 missing
    element from the original
    """
    combos = []
    # sort the input tuple here if it's not already sorted
    for i in range(0, len(tup)):
        tupAsList = list(tup)
        del tupAsList[i]
        combos.append(tupAsList)
    return combos
    
def getKey(combo):
    """
    Creates a string key for a given combination
    """
    strCombo = [str(i) for i in combo]
    return ",".join(strCombo)

def findMatches(tuple_list):
    """
    Returns dict of tuples that match
    """
    matches = {}

    for tup in tuple_list:
        combos = getCombos(tup)
        for combo in combos:
            key = getKey(combo)
            if key in matches:
                matches[key].append(tup)
            else:
                matches[key] = [tup]
                
    # filter out matches with less than 2 elements (optional)
    matches = {k: v for k, v in matches.items() if len(v) > 1}

    return matches
    
    
tuple_list = [(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)]

print(findMatches(tuple_list)) # output: {'2,3,4,5,6': [(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 99)]}

I tested this code against the brute force solution. For 1000 tuples, the brute force version took 5.5s whereas this solution took 0.03s. See repl here

You can rearrange the output by iterating through the values but that may be unnecessary depending on how you're using it

Upvotes: 1

Prune
Prune

Reputation: 77910

This process is inherently O(N^2): you're making a comparison of N items to each of the other N-1 items. This is a distance metric, and the same theoretical results apply (you can look up all-to-all distance algorithms on Stack Overflow and elsewhere). In most cases, there is not enough information to gather from f(A, B) and f(B, C) to predict whether f(A, C) is greater or less than 5.

First of all, quit using tuples: they don't match your use case. Tuples are indexed, and you don't care about the ordering. (01, 02, 03, 04, 05, 06) and (05, 99, 03, 02, 01, 06) match in five numbers, despite having only two positional matches.

Use the natural data types: sets. Your comparison operation is len(A.intersection(B)). Note that you can flip the logic to a straight distance metric: mismatch = len(A-B) and have a little triangle logic, given that all the sets are the same size (see "triangle inequality"). For instance, if len(A-B) is 1, then 5 numbers match. If you also get len(A-C) is 5, then you know that that C differs from B in either 4 or 5 numbers, depending on which number did match.

Given the sparsity of your sets (6 number from at least 99), you can gain a small amount of performance here ... but the overhead and extra checking will likely consume your savings, and the resulting algorithm is still O(N^2).

Upvotes: 0

Related Questions