Shaun Han
Shaun Han

Reputation: 2785

Fastest way to calculate the similarity kernel between matrices of strings?

Say I have a list of 5 matrices called data. Each matrix has an arbitrary number of rows but exactly 3 columns that contain 3 strings. I want to train a Gaussian Process model and let's say data is my training set. I want to calculate the similarity kernel based on the string matching of every pair of matrices. Let's say the 5 matrices look like below:

import numpy as np

data = np.array([# Matrix 0
                 [['b', 'a', 'c'], 
                  ['a', 'a', 'b'], 
                  ['d', 'c', 'c'], 
                  ['a', 'b', 'd']],
             # Matrix 1
             [['d', 'a', 'c'], 
              ['a', 'b', 'c'], 
              ['a', 'd', 'd']],
         # Matrix 2
         [['a', 'b', 'b'], 
          ['d', 'a', 'd'], 
          ['d', 'b', 'a'], 
          ['c', 'b', 'd']],
     # Matrix 3
     [['b', 'b', 'c'], 
      ['a', 'b', 'b'], 
      ['a', 'c', 'a'], 
      ['c', 'b', 'a'], 
      ['b', 'd', 'd']],
 # Matrix 4
 [['a', 'b', 'c'], 
  ['c', 'a', 'b'], 
  ['d', 'd', 'c'], 
  ['a', 'a', 'a']]
], dtype=object)

I want to calculate the similarity between every two matrices. Let's use the first two matrices as an example. They have 4 and 3 rows, respectively. I want to check the string matching of all 4 x 3 pairs. In each pair, we say the difference between each pair of strings (only compares 0-0, 1-1, 2-2) is 0 if they are the same, otherwise 1. This will returns a binary vector diff and then be fed into a squared exponential (or RBF) kernel to get the similarity score between this pair of rows. Then I calculate the similarity scores between all pairs of rows (0-0, 0-1, 0-2, 1-0, ..., 3-0, 3-1, 3-2) and sum them together, and this value is the final similarity between the first and second matrices. I then do this for all pairs of matrices, and I can get a final similarity kernel R and its normalized version K. Below is my implementation:

from itertools import product
import math

def kernel(data, sigma=1.):
    # Initialize the similarity kernel
    R = np.zeros(shape=(len(data), len(data)))

    # Get every pair of matrices (including themselves)
    for iprod in list(product(enumerate(data), enumerate(data))):
        idxs, prod = zip(*[(i, c) for i, c in iprod])
        ks = []

        # Get every pair of rows between the two matrices
        for pair in list(product(*prod)):
            diff = (np.asarray(pair[0]) != np.asarray(pair[1])).astype(int)

            # Squared exponential kernel
            k = math.exp(-np.dot(diff, diff) / (2*sigma**2))
            ks.append(k)

        # Calculate sum and insert it into R[i,j] and R[j,i]
        ktot = np.sum(ks)
        R[idxs[0],idxs[1]] = ktot
        R[idxs[1],idxs[0]] = ktot

    # Normalize the similarity matrix R
    d = np.diag(R)**-0.5
    K = np.diag(d).dot(R).dot(np.diag(d))

    return K

K = kernel(data)
print(K)

Output:

[[1.         0.81009275 0.71374617 0.7365101  0.81061262]
 [0.81009275 1.         0.68228781 0.70349301 0.82009247]
 [0.71374617 0.68228781 1.         0.78137859 0.68976163]
 [0.7365101  0.70349301 0.78137859 1.         0.7365101 ]
 [0.81061262 0.82009247 0.68976163 0.7365101  1.        ]]

%timeit 2.86 ms ± 64.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

However, my code becomes very slow when there are much more training data or there are much more rows in each matrix. My speculation is that this is because I am using itertools.product too much and there is no matrix operation. I feel like there should be a fully vectorized numpy way to do this. I am thinking of np.cov and np.kron but don't know how it could work for strings. Any suggestion is appreciated.

Upvotes: 1

Views: 650

Answers (2)

Lith
Lith

Reputation: 803

You can get about one order of magnitude faster for the case of the list provided in your example if you replace the inner loop vectorized calls. I do not know how to vectorize the outer loop, as data consists of differently shaped arrays... Anyways, in the following snippet we use broadcasting to operate between arrays in a pairwise fashion, as well as einsum for extra coolness ;). The improvement is not as much as that obtained from @Jérôme Richard, but it only uses NumPy!

def kernel_oneLoop(data, sigma = 1.):
    # Convert matrices to arrays first
    data = np.array([np.asarray(d, dtype = object) for d in data])
    # Initialize the similarity kernel
    R = np.zeros(shape=(data.size, data.size))
    # Iterate through upper triangular matrix indices
    idx0, idx1 = np.triu_indices_from(R)
    for i in range(idx0.size):
        diff = (data[idx0[i]] != data[idx1[i]][:,None]).astype(int)
        # Squared exponential kernel (no need to square as they're 0's and 1's)
        k = np.exp(-diff.sum(axis=-1) / 2*sigma**2)
        # Calculate sum and insert it into R[i,j] and R[j,i]
        R[idx0[i],idx1[i]] = k.sum()
    # Normalize the similarity matrix R
    d = np.diag(R)**-0.5
    K = np.einsum("i,ij,j->ij", d, R, d)
    # Symmetrize
    K = K + K.T - np.diag(np.diag(K))
    
    return K

Edit: as well pointed out by Jérôme, as the output is a symmetrical array, we could just loop over the upper triangular indices of the array. I updated the code accordingly.

Upvotes: 1

Jérôme Richard
Jérôme Richard

Reputation: 50488

You can use the Numba's JIT to speed up a bit the hottest loop of the computation. However, Numba does not support strings very well yet. Thus the strategy is to convert the string into factors first (unique integer identifying strings). The benefit of this method is also to significantly speed up the row comparisons. Moreover, several additional optimizations can be applied:

  • replacement of the slow product calls by simple nested loops
  • computation of the summations on-the-fly
  • manual computation of the norm (since the result of the != is a 0|1 boolean value)
  • take advantage of the symmetry to divide by 2 the number of operations

Here is the resulting implementation:

import numba as nb
import numpy as np
from itertools import product


# Actual computational kernel
# Work on a list of 2D arrays containings 32-bit integers contiguously stored in memory
@nb.njit('float64[:,::1](ListType(int32[:,::1]), float64)')
def computePairs(data, sigma):
    # Initialize the similarity kernel
    matCount = len(data)
    R = np.ones(shape=(matCount, matCount))
    normFactor = -1.0 / (2*sigma**2)

    # Get every pair of matrices (including themselves)
    for i in range(matCount):
        for j in range(i, matCount): # Note the matrix is symetric
            ktot = 0.0

            # Get every pair of rows between the two matrices
            for a in data[i]:
                for b in data[j]:
                    sqNorm = 0

                    for k in range(len(a)):
                        sqNorm += a[k] != b[k]

                    # Squared exponential kernel
                    ktot += np.exp(sqNorm * normFactor)

            R[i,j] = ktot
            R[j,i] = ktot

    return R

# Transform the list of list of list of strings into a list of 2D factor arrays
def convertData(data):
    labelToId = {}
    labelCount = 0
    result = []
    
    for labelMat in data:
        height = len(labelMat)
        factorMat = np.empty((height, 3), dtype=np.int32)

        for rowId in range(height):
            row = labelMat[rowId]
            for cellId in range(3):
                cell = row[cellId]

                if cell in labelToId:
                    labelId = labelToId[cell]
                else:
                    labelId = labelCount
                    labelToId[cell] = labelCount
                    labelCount += 1

                factorMat[rowId, cellId] = labelId

        result.append(factorMat)

    return nb.typed.List(result)

def kernel_fast(data, sigma=1.):
    data = convertData(data)

    R = computePairs(data, sigma)

    # Normalize the similarity matrix R
    d = np.diag(R)**-0.5
    K = np.diag(d).dot(R).dot(np.diag(d))

    return K

This is about 45 times faster on my machine on the small provided input example. About 70% of the time is spend in data conversion. This means that the computation can be roughtly 3 times even faster if the input would directly fit to the provided algorithm.

If you plan to work on bigger matrices or more of them, please note that you can parallelize the above algorithm to make it even faster (it does not worth it for small inputs).

Upvotes: 1

Related Questions