Reputation: 412
I am trying to calculate similarity with numpy functions. My arrays(current_cart and data_matrix) contain only 0 and 1. Therefore, I am using np.int8 as data type. For speed up the calculation I am using numba but I am getting following error.
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend) Failed in nopython mode pipeline (step: nopython frontend)
My code:
from numba import jit, prange
from numpy.linalg import norm
from numpy import zeros, dot, float64
@jit(nopython=True)
def cosine_similarity(a,b):
dt = dot(a,b)
if(abs(dt)<=1e-10):
return 0
else:
return dt/(norm(a)*norm(b))
@jit(nopython=True, parallel=True)
def calculate_similarity_parallel(multiple_item, single_item):
n = multiple_item.shape[0]
scores = zeros(shape=(n), dtype=float64)
for i in prange(n):
scores[i] = cosine_similarity(a=single_item, b=multiple_item[i])
return scores
scores = calculate_similarity_parallel(
multiple_item=data_matrix,
single_item=current_cart
)
data looks as below
data_matrix = [[1 0 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
current_cart = [1 1 0 ... 0 0 0]
Error as below
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function dot at 0x7f88a907f040>) found for signature:
>>> dot(array(int8, 1d, C), array(int8, 1d, C))
There are 4 candidate implementations:
- Of which 4 did not match due to:
Overload in function '_OverloadWrapper._build.<locals>.ol_generated': File: numba/core/overload_glue.py: Line 131.
With argument(s): '(array(int8, 1d, C), array(int8, 1d, C))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<intrinsic stub>) found for signature:
>>> stub(array(int8, 1d, C), array(int8, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Intrinsic in function 'stub': File: numba/core/overload_glue.py: Line 35.
With argument(s): '(array(int8, 1d, C), array(int8, 1d, C))':
Rejected as the implementation raised a specific error:
TypingError: np.dot() only supported on float and complex arrays
raised from /home/ak/Desktop/recommendation/venv/lib/python3.9/site-packages/numba/core/typing/npydecl.py:970
During: resolving callee type: Function(<intrinsic stub>)
During: typing of call at <string> (3)
File "<string>", line 3:
<source missing, REPL/exec in use?>
raised from /home/ak/Desktop/recommendation/venv/lib/python3.9/site-packages/numba/core/typeinfer.py:1086
During: resolving callee type: Function(<function dot at 0x7f88a907f040>)
During: typing of call at /home/ak/Desktop/recommendation/./cf_api/utils/utils.py (7)
File "cf_api/utils/utils.py", line 7:
def cosine_similarity(a,b):
dt = dot(a,b)
^
During: resolving callee type: type(CPUDispatcher(<function cosine_similarity at 0x7f88a8706f70>))
During: typing of call at /home/ak/Desktop/recommendation/./cf_api/utils/utils.py (18)
During: resolving callee type: type(CPUDispatcher(<function cosine_similarity at 0x7f88a8706f70>))
During: typing of call at /home/ak/Desktop/recommendation/./cf_api/utils/utils.py (18)
File "cf_api/utils/utils.py", line 18:
def calculate_similarity_parallel(multiple_item, single_item):
<source elided>
for i in prange(n):
scores[i] = cosine_similarity(a=single_item, b=multiple_item[i])
^
Is there any idea how to solve that?
Upvotes: 3
Views: 647
Reputation: 50308
What the error means is simply that dot
is not implemented for the int8
type (the same applies for norm
by the way). Thus, you need to reimplement it. This is unfortunately quite common with Numba. That being said, this is nor a big deal here since it can be easily implemented using a basic loop. This is not so bad since it make your think about the type of the accumulator to choose. Indeed, Numpy uses the one of the array by default (ie. int8
) which certainly causes some sneaky hidden overflow when calling dot
. The best type to choose is very dependent of the size of the arrays (which is not provided) and the input values. It also impacts performance (smaller types are generally faster in such case due to the potential use of SIMD instructions). Additionally, there is no need for an actual multiplication since the input contains binary values so the dot can be optimized by using logical ANDs. Moreover, note that abs(dt)<=1e-10
does not make much sense since the output must be an integer. Finally, the norm
can also be optimized since the square of binary value is the identity function and so there is no need to actually square the values.
import numba as nb
import numpy as np
@nb.njit('float64(int8[::1], int8[::1])')
def cosine_similarity(a,b):
# Large safe integer type (int16 is less safe but certainly faster)
dt = np.int32(0)
for i in range(a.size):
dt += a[i] & b[i]
if dt == 0:
return 0.0
sa, sb = np.int32(0), np.int32(0)
for i in range(a.size):
sa += a[i]
sb += b[i]
return dt / np.sqrt(sa * sb)
@nb.njit('float64[:](int8[:,::1], int8[::1])', parallel=True)
def calculate_similarity_parallel(multiple_item, single_item):
n = multiple_item.shape[0]
scores = np.zeros(n, np.float64)
for i in nb.prange(n):
scores[i] = cosine_similarity(single_item, multiple_item[i])
return scores
scores = calculate_similarity_parallel(
multiple_item=data_matrix,
single_item=current_cart
)
This is several hundred times faster than the initial code with Numba disabled on my machine. Note that sa
is recomputed for each line of data_matrix
which may be slower than if it would be computed once.
Upvotes: 2