glpsx
glpsx

Reputation: 679

Unpacking sparse matrix performance tuning

I'm using the sparse_dot_topn library created by the Data Scientists at ING to search for near duplicates in a large set of company names (nearly 1.5M records). A recent update of this library now makes it possible to use multiple threads to compute the cross-product (i.e., the cosine similarity) between the two matrices. I ran a quick benchmark and the performance improvement is significant (depending on how many cores one can use on his machine/remote server):

+-----------+--------------+
| # threads | time (%M:%S) |
+-----------+--------------+
| 32        | 03:43:12     |
+-----------+--------------+
| 16        | 05:16:97     |
+-----------+--------------+
| 8         | 08:11:69     |
+-----------+--------------+
| 4         | 13:32:72     |
+-----------+--------------+
| 2         | 24:02:28     |
+-----------+--------------+
| 1         | 47:11:30     |
+-----------+--------------+

In order to easily explore the outcome, I needed to unpack the resulting sparse matrix. Fortunately, I found the following helper function written by Chris van den Berg which does exactly that (link to Chris's blog post here):

def get_matches_df(sparse_matrix, name_vector, top=100):
    non_zeros = sparse_matrix.nonzero()

    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]

    if top:
        nr_matches = top
    else:
        nr_matches = sparsecols.size

    left_side = np.empty([nr_matches], dtype=object)
    right_side = np.empty([nr_matches], dtype=object)
    similairity = np.zeros(nr_matches)

    for index in range(0, nr_matches):
        left_side[index] = name_vector[sparserows[index]]
        right_side[index] = name_vector[sparsecols[index]]
        similairity[index] = sparse_matrix.data[index]

    return pd.DataFrame(
        {"left_side": left_side, "right_side": right_side, "similairity": similairity}
    )

The above function has an optional argument to look only at the first n values but I have to run it on the full data. My issue at the moment is that this takes a very long time to complete (roughly 1 hour).

Q: I was wondering how I could increase the performance here (if possible)? Especially since I have quite a few cores which I'm not using for this job.

I'm no expert when it comes to performance tuning. One option I explored is Numba. I decorated the function with @njit(parallel=True)and used Numba’s prange instead of range to specify that the loop can be parallelized but this failed. My understanding is that Numba can't handle string values (i.e., the company names in my case).

Any help on possible approaches to increase the performance would be greatly appreciated.

Upvotes: 1

Views: 368

Answers (2)

meTchaikovsky
meTchaikovsky

Reputation: 7666

I assume the sparse_matrix is a correlation matrix, so sparse_matrix is symmetric.

First, create a name_vector and sparse_matrix to work with

import string

N = 10

# create an array of names
name_vector = np.array(list(string.ascii_lowercase)[:N])
# create a correlation matrix (which is obviously symmetric)
sparse_matrix = np.random.rand(N,N)
sparse_matrix = (sparse_matrix + sparse_matrix.T)/2
zeros_mask = np.where(np.random.rand(N,N)>=0.5,False,True)
sparse_matrix[zeros_mask] = 0.

As you can see, the name_vector is an array

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='<U1')

which corresponds to the names of 10 companies. sparse_matrix is symmetric by construction, and some of its entries are assigned 0 by sparse_matrix[zeros_mask] = 0..

With the two ingredients, here is my solution

top = None 

non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
sparse_idx = sparserows*sparse_matrix.shape[1]+sparsecols

if top:
    nr_matches = top
else:
    nr_matches = sparsecols.size

left_side = name_vector[sparserows[:nr_matches]]
right_side = name_vector[sparsecols[:nr_matches]]
similairity = np.take(sparse_matrix,sparse_idx[:nr_matches])

pd.DataFrame({"left_side": left_side, 
              "right_side": right_side, 
              "similairity": similairity})

and the DataFrame that is returned looks like

left_side   right_side  similairity
0   a   c   0.760297
1   a   d   0.441365
2   a   g   0.669365
3   b   a   0.221993
4   b   c   0.840993
...

since advanced indexing is used instead of a for loop, it will be much faster.

Upvotes: 1

CJR
CJR

Reputation: 3985

Without some examples I can't be sure this is what you're looking for, but I think this is what you want. I'm confused about the top in your example because it just takes the first results and not the results with the largest values.

import pandas as pd
from scipy import sparse
import random
import string

arr = sparse.random(100,100,density=0.02).tocoo()
name_vec = pd.Series(''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(6)) for _ in range(100))

pd.DataFrame({"left_side": name_vec[arr.row].tolist(), 
              "right_side": name_vec[arr.col].tolist(), 
              "similairity": arr.data})

In terms of runtime you might be able to clean this up more by avoiding the series -> list -> series step.

Upvotes: 1

Related Questions