derived bee
derived bee

Reputation: 21

Fastest way to find indices of highest value in a matrix iteratively and exclusionarily

I'm attempting to find the "best hits" in a similarity matrix (i.e. an mxn matrix where index along each axis corresponds to the ith position in vector m and the jth position in vector n). The simplest way to explain this is finding the indices of the highest values in a matrix iteratively, excluding previously selected rows and columns. This results in min(m,n) indices chosen.

Here's my minimum reproducible example of my current implementation, using pandas:

import numpy as np
import pandas as pd

def pairwise_best_hit(sim):
    xdim,ydim = np.meshgrid(np.arange(sim.shape[1]),np.arange(sim.shape[0]))
    table = np.vstack((sim.ravel(),xdim.ravel(),ydim.ravel())).T
    df = pd.DataFrame(table).rename(columns={0:'sim',1:'index2',2:'index1'}).sort_values('sim',ascending=False)
    seq1_hits = []
    seq2_hits = []
    while len(df):
        index1 = df.iloc[0]['index1']
        index2 = df.iloc[0]['index2']
        seq1_hits.append(index1)
        seq2_hits.append(index2)
        df = df[(df['index1']!=index1)&(df['index2']!=index2)]
    return [seq1_hits,seq2_hits]

and for a matrix

sim = np.array([[1,5,6,2],[7,10,3,4],[1,5,3,7]])
pairwise_best_hit(sim)

returns

[[1, 2, 0], [1, 3, 2]]

Upvotes: 2

Views: 87

Answers (4)

Axel Donath
Axel Donath

Reputation: 1723

This basically the same answer as from @Julien. However rewritten as a recursive approach, which I think is slightly cleaner in intent:

def pairwise_best_hit(arr, indices):
    """Find the maximum value in a 2D array recursively."""
    if not np.any(np.isfinite(arr)):
        return np.transpose(indices)
    
    idx_max = np.argmax(arr)
    idxs = np.unravel_index(idx_max, arr.shape)
    arr[idxs[0], :] = -np.inf
    arr[:, idxs[1]] = -np.inf
    indices.append(idxs)
    return pairwise_best_hit(arr, indices)

pairwise_best_hit(sim.astype(float),  indices=[])

Which returns:

array([[1, 2, 0],
       [1, 3, 2]])

Upvotes: 0

Marce Puente
Marce Puente

Reputation: 639

This is probably not the most efficient way to do it (although it runs faster in my tests)... I also don't know if the order of the result has to be the same as the example provided (the order I get is different)...

Explanation:
First, we create two lists equal in size to the number of rows in the received matrix (outA and outB), in which we will store the obtained values, and a variable max where we will store the maximum value found in each row.

The outer for stores the row number in outA, while the inner for iterates through the current row, updating the values ​​of max and outB if necessary.

Once the inner for completes, the value of max is reset, and execution continues, finally returning a matrix created "on the fly" with both lists.

sim = [
    [ 1, 5, 6, 2 ],
    [ 7, 10, 3, 4 ],
    [ 1, 5, 3, 7 ]
]

def pairwise_best_hit( sim ):
    max = 0
    pos = [ 0, 0 ]
    outA = [ None ] * len( sim )
    outB = [ None ] * len( sim )
    
    for i in range( len( sim )):
        outA[ i ] = i
        for k in range( len( sim[ i ] )):
            if sim[ i ][ k ] > max:
                max = sim[ i ][ k ]
                outB[ i ] = k
        max = 0

    return [ outA, outB ]

print( pairwise_best_hit( sim )) ## print -> [[0, 1, 2], [2, 1, 3]]

Upvotes: 0

LastDuckStanding
LastDuckStanding

Reputation: 21

Your solution looks like it is O(min(m, n) * mn), because every time you are going through the whole DataFrame and removing those entries that are in the same row and column and recreating a new DataFrame. it looks like Julien's solution has this complexity (except that it probably uses parallel processing and SIMD well) as well because it still goes through the whole array for the argmax.

To achieve O(mn * log(mn)) just don't modify your DataFrame, but iterate through the entries one by one from the highest value and discard it when it is already in the same row or column as the entries you extracted. You can use two sets (or array of flags) to store the previously chosen rows and columns respectively.

import timeit
import numpy as np
import pandas as pd


def original(sim):
    xdim, ydim = np.meshgrid(np.arange(sim.shape[1]), np.arange(sim.shape[0]))
    table = np.vstack((sim.ravel(), xdim.ravel(), ydim.ravel())).T
    df = pd.DataFrame(table).rename(columns={
        0: 'sim',
        1: 'index2',
        2: 'index1'
    }).sort_values('sim', ascending=False)
    seq1_hits = []
    seq2_hits = []
    while len(df):
        index1 = df.iloc[0]['index1']
        index2 = df.iloc[0]['index2']
        seq1_hits.append(index1)
        seq2_hits.append(index2)
        df = df[(df['index1'] != index1) & (df['index2'] != index2)]
    return np.asanyarray([seq1_hits, seq2_hits])


def sort_w_set(mat):
    idxs = np.argsort(mat, None)
    hit_is = []
    hit_js = []
    hit_is_set = set()
    hit_js_set = set()
    num_entries = min(mat.shape[0], mat.shape[1])
    for idx in reversed(idxs):
        i, j = idx // mat.shape[1], idx % mat.shape[1]
        if i in hit_is_set or j in hit_js_set:
            continue
        hit_is.append(i)
        hit_is_set.add(i)
        hit_js.append(j)
        hit_js_set.add(j)
        if len(hit_is) == num_entries:
            break
    return np.asanyarray([hit_is, hit_js])


def julien(sim: np.ndarray):
    out = []
    copy = sim.copy()
    for _ in range(min(copy.shape)):
        ij = np.unravel_index(copy.argmax(), copy.shape)
        out.append(ij)
        copy[ij[0]] = -1.0
        copy[:, ij[1]] = -1.0
    return np.transpose(out)


def main():
    mat = np.array([[1, 5, 6, 2], [7, 10, 3, 4], [1, 5, 3, 7]])
    np_rand = np.random.default_rng()
    mat = np_rand.random([100, 100])
    res = None
    for f in [sort_w_set, julien, original]:
        res2 = f(mat)
        if res is None:
            res = res2
        else:
            assert np.all(res2 == res)
    for m in [100, 200, 500, 1000, 2000]:
        np_rand = np.random.default_rng()
        mat = np_rand.random([m, m])
        for f in [sort_w_set, julien, original]:
            print(f.__name__,
                  timeit.timeit((lambda f2: lambda: f2(mat))(f), number=1))


if __name__ == "__main__":
    main()

To achieve O(min(m, n) * log(mn))) you probably need to do something much more complicated (or maybe it's impossible).

Results:

sort_w_set 0.001746996000292711
julien     0.0006907330025569536
original   0.04714291199707077
sort_w_set 0.008418075001827674
julien     0.0040397619995928835
original   0.1409220829991682
sort_w_set 0.1129061859974172
julien     0.0967205240012845
original   1.0798262329990393
sort_w_set 0.4717207870016864
julien     0.6536550560012984
original   5.681719337997492
sort_w_set 1.68841458800307
julien     4.9140408969979035
original   39.55271187499966

Upvotes: 2

Julien
Julien

Reputation: 15226

Here's a candidate using numpy only, about 100x faster on your small example:

import numpy as np
import pandas as pd
import timeit

sim = np.array([[1,5,6,2],[7,10,3,4],[1,5,3,7]])

def mine(sim):
    out = []
    copy = sim.copy()
    MIN = np.iinfo(copy.dtype).min
    for _ in range(min(copy.shape)):
        ij = np.unravel_index(copy.argmax(), copy.shape)
        out.append(ij)
        copy[ij[0]] = MIN
        copy[:,ij[1]] = MIN
    return np.transpose(out)

def yours(sim):
    xdim,ydim = np.meshgrid(np.arange(sim.shape[1]),np.arange(sim.shape[0]))
    table = np.vstack((sim.ravel(),xdim.ravel(),ydim.ravel())).T
    df = pd.DataFrame(table).rename(columns={0:'sim',1:'index2',2:'index1'}).sort_values('sim',ascending=False)
    seq1_hits = []
    seq2_hits = []
    while len(df):
        index1 = df.iloc[0]['index1']
        index2 = df.iloc[0]['index2']
        seq1_hits.append(index1)
        seq2_hits.append(index2)
        df = df[(df['index1']!=index1)&(df['index2']!=index2)]
    return [seq1_hits,seq2_hits]

assert np.all(mine(sim) == yours(sim))

%timeit yours(sim)
# 1.05 ms ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit mine(sim)
# 8.18 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Comparison slowly degrades for larger arrays, but stays ahead (still 10x faster on 1000x1000 arrays):

sim = np.arange(10000)
np.random.shuffle(sim)
sim.shape = 100,100
%timeit yours(sim)
# 26.2 ms ± 64.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mine(sim)
# 397 µs ± 534 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

sim = np.arange(1000000)
np.random.shuffle(sim)
sim.shape = 1000,1000
%timeit yours(sim)
# 2.45 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit mine(sim)
#203 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

Related Questions