Reputation: 75
I work with data set consists about 200k objects. Every object has 4 features. I classifies them by K nearest neighbors (KNN) with euclidean metric. Process is finished during about 20 seconds.
Lately I've got a reason to use custom metric. Probably it will make better results. I've implemented custom metric and KNN has become to work more than one hour. I didn't wait for finishing of it.
I assumed that a reason of this issue is my metric. I replace my code by return 1
. KNN still worked more than one hour. I assumed that a reason is algorithm Ball Tree, but KNN with it and euclidean metric works during about 20 seconds.
Right now I have no idea what's wrong. I use Python 3 and sklearn 0.17.1. Here process can't be finished with custom metric. I also tried algorithm brute
but it has same effect. Upgrade and downgrade of scikit-learn have no effect. Implementing classification by custom metric on Python 2 has no positive effect too. I implemented this metric (just return 1) on Cython, it has same effect.
def custom_metric(x: np.ndarray, y: np.ndarray) -> float:
return 1
clf = KNeighborsClassifier(n_jobs=1, metric=custom_metric)
clf.fit(X, Y)
Can I boost classification process by KNN with custom metric?
Sorry if my english is not clear.
Upvotes: 2
Views: 3951
Reputation: 66
Sklearn is optimized and use cython and several process to run as fast as possible. Writing pure python code especially when it is called several times is the cause that slows your code. I recommend that you write your custom metric using cython. You have a tutorial that you can follow right here : https://www.sicara.fr/blog-technique/2017-07-05-fast-custom-knn-sklearn-cython
Upvotes: 4
Reputation: 798
I used a dataset with ~5k train rows, ~20 features, and ~1k rows for inference (validation). I compared the following:
scipy correlation
function https://github.com/scipy/scipy/blob/v1.10.0/scipy/spatial/distance.py#L577-L624
numba:
import numba
from numba import jit
@jit(nopython=True)
def corr_numba(u,v):
umu = np.average(u)
vmu = np.average(v)
u = u - umu
v = v - vmu
uv = np.average(u*v)
uu = np.average(np.square(u))
vv = np.average(np.square(v))
dist = 1.0 - uv / np.sqrt(uu*vv)
return dist
corr_numba(np.array([0,1,1]), np.array([1,0,0])) # 2.0
%%cython --annotate
import numpy as np
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI
def corr(double[:] u, double[:] v):
cdef u_sum = 0
cdef v_sum = 0
cdef uv_sum = 0
cdef uu_sum = 0
cdef vv_sum = 0
cdef n_elems = u.shape[0]
for i in range(n_elems):
u_sum += u[i]
v_sum += v[i]
u_sum = u_sum / n_elems
v_sum = v_sum / n_elems
for i in range(n_elems):
uv_sum += (u[i]-u_sum)*(v[i]-v_sum)
uu_sum += (u[i]-u_sum)*(u[i]-u_sum)
vv_sum += (v[i]-v_sum)*(v[i]-v_sum)
uv_sum = uv_sum / n_elems
vv_sum = vv_sum / n_elems
uu_sum = uu_sum / n_elems
dist = 1.0 - uv_sum / sqrt(uu_sum*vv_sum)
return dist
corr(np.array([0.,1,1]), np.array([1.,0,0])) # 2.0
The results:
metric <function correlation at 0x7f8b2373d7e0> 0.934, train_time: 0.001, infer_time: 182.970
metric CPUDispatcher(<function corr_numba at 0x7f8b1a028af0>) 0.934, train_time: 0.001, infer_time: 6.239
metric <built-in function corr> 0.934, train_time: 0.001, infer_time: 32.485
metric <function correlation at 0x7f8b2373d7e0> 0.935, train_time: 1.252, infer_time: 80.716
metric CPUDispatcher(<function corr_numba at 0x7f8b1a028af0>) 0.935, train_time: 0.049, infer_time: 2.336
metric <built-in function corr> 0.935, train_time: 0.249, infer_time: 12.725
Upvotes: 0
Reputation: 810
As rightly pointed by @Réda Boumahdi the cause is using custom metric defined in python. This is a known issue discussed here. It was closed as "wontfix" at the end of the discussion. So, only solution suggested is writing your custom metric in cython to avoid GIL that slows down in case of using python metric.
Upvotes: 1