Reputation: 601
I have a tuple list which contains tuples of 6 digits, ranging from 01 to 99. For example:
tuple_list = {(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)}
For every tuple on this list I need to effectively search if there are any other tuples that have at least 5 numbers in common with it (excluding the searched number). So for the above example, the result will be:
(01,02,03,04,05,06) -> (02,03,04,05,06,99)
(20,22,24,26,28,30) -> []
(02,03,04,05,06,99) -> (01,02,03,04,05,06)
The list itself is big and can hold up to 1,000,000 records.
I tried the naive approach of scanning the list one-by-one, but this has an O(n^2)
complexity and takes a lot of time.
I thought about maybe using a dict
but I can't find a way to search for part of a key (it would have worked fine if I needed to search for the exact key).
Maybe some sort of a suffix/prefix tree variation is needed, but I can't seem to figure it out.
Any help will be appreciated.
Upvotes: 0
Views: 103
Reputation: 2502
The code below generates a dict where they key is a 5-tuple and the value is a list of all the tuples that have those 5 elements.
It runs in O(nm) where n is the size of the tuple list and m is the size of each tuple. For 6-tuples, it runs in O(6n). See test results below
def getCombos(tup):
"""
Produces all combinations of the tuple with 1 missing
element from the original
"""
combos = []
# sort the input tuple here if it's not already sorted
for i in range(0, len(tup)):
tupAsList = list(tup)
del tupAsList[i]
combos.append(tupAsList)
return combos
def getKey(combo):
"""
Creates a string key for a given combination
"""
strCombo = [str(i) for i in combo]
return ",".join(strCombo)
def findMatches(tuple_list):
"""
Returns dict of tuples that match
"""
matches = {}
for tup in tuple_list:
combos = getCombos(tup)
for combo in combos:
key = getKey(combo)
if key in matches:
matches[key].append(tup)
else:
matches[key] = [tup]
# filter out matches with less than 2 elements (optional)
matches = {k: v for k, v in matches.items() if len(v) > 1}
return matches
tuple_list = [(01,02,03,04,05,06), (20,22,24,26,28,30), (02,03,04,05,06,99)]
print(findMatches(tuple_list)) # output: {'2,3,4,5,6': [(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 99)]}
I tested this code against the brute force solution. For 1000 tuples, the brute force version took 5.5s whereas this solution took 0.03s. See repl here
You can rearrange the output by iterating through the values but that may be unnecessary depending on how you're using it
Upvotes: 1
Reputation: 77910
This process is inherently O(N^2): you're making a comparison of N items to each of the other N-1 items. This is a distance metric, and the same theoretical results apply (you can look up all-to-all distance algorithms on Stack Overflow and elsewhere). In most cases, there is not enough information to gather from f(A, B)
and f(B, C)
to predict whether f(A, C)
is greater or less than 5.
First of all, quit using tuples: they don't match your use case. Tuples are indexed, and you don't care about the ordering. (01, 02, 03, 04, 05, 06) and (05, 99, 03, 02, 01, 06) match in five numbers, despite having only two positional matches.
Use the natural data types: sets. Your comparison operation is len(A.intersection(B))
. Note that you can flip the logic to a straight distance metric: mismatch = len(A-B)
and have a little triangle logic, given that all the sets are the same size (see "triangle inequality").
For instance, if len(A-B)
is 1, then 5 numbers match. If you also get len(A-C)
is 5, then you know that that C differs from B in either 4 or 5 numbers, depending on which number did match.
Given the sparsity of your sets (6 number from at least 99), you can gain a small amount of performance here ... but the overhead and extra checking will likely consume your savings, and the resulting algorithm is still O(N^2).
Upvotes: 0