Godrebh
Godrebh

Reputation: 574

Matching two lists containing slightly differing float values by allowing a tolerance

I have two sorted lists containing float values. The first contains the values I am interested in (l1) and the second list contains values I want to search (l2). However, I am not looking for exact matches and I am tolerating differences based on a function. Since I have do this search very often (>>100000) and the lists can be quite large (~5000 and ~200000 elements), I am really interested in runtime. At first, I thought I could somehow use numpy.isclose(), but my tolerance is not fixed, but depending on the value of interest. Several nested for loops work, but are really slow. I am sure that there is some efficient way to do this.

#check if two floats are close enough to match
def matching(mz1, mz2):
    if abs( (1-mz1/mz2) * 1000000) <= 2:
        return True
    return False

#imagine another huge for loop around everything
l1  = [132.0317, 132.8677, 132.8862, 133.5852, 133.7507]
l2  = [132.0317, 132.0318, 132.8678, 132.8861, 132.8862, 133.5851999, 133.7500]

d = {i:[] for i in l1}
for i in l1:
    for j in l2:
        if matching(i, j):
            d[i].append(j)

fyi: As an alternative to the matching function, I could also create a dictionary first, mapping the values of interest from l1 to the window (min ,max) I would allow. e.g. {132.0317:(132.0314359366, 132.0319640634), ...}, but I think checking for each value from l2 if it lies within one of the windows from this dictionary would be even slower...

This would be how to generate the dictionary containing min/max values for each value from l1:

def calcMinMaxMZ(mz, delta_ppm=2):
    minmz = mz- (mz* +delta_ppm)/1000000
    maxmz = mz- (mz* -delta_ppm)/1000000
    return minmz, maxmz

minmax_d = {mz:calcMinMaxMZ(mz, delta_ppm=2) for mz in l1}

The result may be a dictionary like this: d = {132.0317: [132.0317, 132.0318], 132.8677: [132.8678], 132.8862: [132.8862, 132.8861], 133.5852: [133.5851999], 133.7507: []} But I actually do much more, when there is a match.

Any help is appreciated!

Upvotes: 6

Views: 746

Answers (3)

Alain T.
Alain T.

Reputation: 42133

If you transpose your formula to produce a range of mz2 values for a given mz1, you could use a binary search to find the first match in the sorted l2 list, then work your way up sequentially until you reach the end of the range.

def getRange(mz1):
    minimum = mz1/(1+2/1000000) 
    maximum = mz1/(1-2/1000000)
    return minimum,maximum

l1  = [132.0317, 132.8677, 132.8862, 133.5852, 133.7507]
l2  = [132.0317, 132.0318, 132.8678, 132.8862, 132.8861, 133.5851999, 133.7500]

l2  = sorted(l2)
from bisect import bisect_left
d = { mz1:[] for mz1 in l1 }
for mz1 in l1:
    lo,hi = getRange(mz1)
    i = bisect_left(l2,lo)
    while i < len(l2) and l2[i]<= hi:
        d[mz1].append(l2[i])
        i+=1

Sorting l2 will cost O(NlogN) and the dictionary creation will cost O(MlogN) where N is len(l2) and M is len(l1). You will only be applying the tolerance/range formula M times instead of N*M times which should save a lot of processing.

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195553

I re-implemented the for loop using itertools. For it working, the inputs must be sorted. For benchmark I generated 1000 items from <130.0, 135.0> for l1 and 100_000 items from <130.0, 135.0> for l2:

from timeit import timeit
from itertools import tee
from random import uniform

#check if two floats are close enough to match
def matching(mz1, mz2):
    if abs( (1-mz1/mz2) * 1000000) <= 2:
        return True
    return False

#imagine another huge for loop around everything
l1 = sorted([uniform(130.00, 135.00) for _ in range(1000)])
l2 = sorted([uniform(130.00, 135.00) for _ in range(100_000)])

def method1():
    d = {i:[] for i in l1}
    for i in l1:
        for j in l2:
            if matching(i, j):
                d[i].append(j)
    return d

def method2():
    iter_2, last_match = tee(iter(l2))
    d = {}
    for i in l1:
        d.setdefault(i, [])
        found = False
        while True:
            j = next(iter_2, None)
            if j is None:
                break
            if matching(i, j):
                d[i].append(j)
                if not found:
                    iter_2, last_match = tee(iter_2)
                    found = True
            else:
                if found:
                    break
        iter_2, last_match = tee(last_match)
    return d

print(timeit(lambda: method1(), number=1))
print(timeit(lambda: method2(), number=1))

Prints:

16.900722101010615
0.030588202003855258

Upvotes: 3

Green Cloak Guy
Green Cloak Guy

Reputation: 24691

Your lists are already sorted, so you can maybe use paradigm similar to the "Merge" part of MergeSort: keep track of the current element of both idx1 and idx2, and when one of them is acceptable, process it and advance only that index.

d = {i:[] for i in l1}
idx1, idx2 = 0, 0
while idx1 < len(l1):
    while matching(l1[idx1], l2[idx2]) and idx2 < len(l2):
        d[l1[idx1]].append(l2[idx2])
        idx2 += 1
    idx1 += 1

print(d)
# {132.0317: [132.0317, 132.0318], 132.8677: [132.8678], 132.8862: [132.8862, 132.8861], 133.5852: [133.5851999], 133.7507: []}

this is O(len(l1) + len(l2)), since it executes exactly once for each element of both lists.

The big caveat here is that this never "steps back" - if the current element of l1 matches the current element of l2 but the next element of l1 would also match the current element of l2, then the latter does not get listed. Fixing this might require adding some sort of "look-back" functionality (which would drive the complexity class up by a magnitude of n in the worst case, but would still be quicker than iterating through both lists repeatedly). However, it does work for your given dataset.

Upvotes: 0

Related Questions