Norhther
Norhther

Reputation: 500

Improving performance calculating Kernel Matrix

I have the following code:

import numpy as np
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from functools import partial
import pandas as pd

def tanimotoKernel(xs, ys):
    a = 0
    b = 0
    for x, y in zip(xs, ys):
        a += min(x, y)
        b += max(x, y)
    return a / b

#gammaExp = 1/(np.exp(gamma) - 1), calculated outside the kernel
def tanimotoLambdaKernel(xs,ys, gamma, gammaExp):
    return np.exp(gamma * tanimotoKernel(xs,ys) - 1) * gammaExp

class GramBuilder:
    def __init__(self, Kernel):
        self._Kernel = Kernel
    def generateMatrixBuilder(self, X1, X2):
        gram_matrix = np.zeros((X1.shape[0], X2.shape[0]))
        for i, x1 in enumerate(X1):
            for j, x2 in enumerate(X2):
                gram_matrix[i, j] = self._Kernel(x1, x2)
        return gram_matrix

gammaList = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]
CList = [0.001, 0.01, 0.1, 1, 10, 100]


X, y = datasets.load_digits(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y)

svc_list = [
    (svm.SVC(
        kernel=GramBuilder(
            partial(tanimotoLambdaKernel, gamma = x, gammaExp = 1/(np.exp(x) - 1)))
        .generateMatrixBuilder), 
     x)
    for x in gammaList
]

gammas   = []
Cs       = []
accuracy = []
for svc, gamma in svc_list:
    print("Training gamma ", gamma)
    clf = GridSearchCV(svc, {'C' : CList}, verbose = 1, n_jobs = -1)
    clf.fit(x_train, y_train)
    gammas.append(gamma)
    Cs.append(clf.best_params_['C'])
    accuracy.append(clf.best_score_)

For this toy dataset, I have to wait 50 minutes approx to perform all the cross validations in the loop.

The first improvement I did was to calculate gammaExp outside the function, so I can save millions of exponentials. Also I multiplication is faster than division, so I calculated the inverse of the exponential minus one to also try to save more time.

With those modifications I improved a lot the time training the models, however I need it to be faster, so I would appreciate any ideas. Thanks.

Upvotes: 1

Views: 109

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50278

You can use Numpy to speed up the min/max operations. Then you can use Numba's JIT to speed up the code even more by inlining calls.

import numba as nb

@nb.njit
def tanimotoKernel(xs, ys):
    a = np.minimum(xs, ys).sum()
    b = np.maximum(xs, ys).sum()
    return a / b

@nb.njit
def tanimotoLambdaKernel(xs,ys, gamma, gammaExp):
    return np.exp(gamma * tanimotoKernel(xs,ys) - 1) * gammaExp

# [...]

The above code should be correct and is more than 20 times faster on my machine. It took actually only few minutes to complete.

I think you can speed things up even more by removing the partial call and use Numba for the GramBuilder class too (look at the Numba documentation to JIT class, partial function are probably not supported, but you can store values in the class and do part of the job yourself). Moreover, note that many operation seems performed multiple times in the kernel. I is probably possible to compute them once (the kernel is called with the same x2 multiple times and recompute the max again and again).

Upvotes: 2

Related Questions