Reputation: 2025
I am profiling a numpy dot product call.
numpy.dot(pseudo,pseudo)
pseudo
is a numpy array of custom objects. Defined as:
pseudo = numpy.array(
[[PseudoBinary(1), PseudoBinary(0), PseudoBinary(1)],
[PseudoBinary(1), PseudoBinary(0), PseudoBinary(0)],
[PseudoBinary(1), PseudoBinary(0), PseudoBinary(1)]])
PseudoBinary
is a class that has a custom multiply function. It ORs instead of multiplying. See below for the complete code of PseudoBinary definition.
Type:
(Pdb) pseudo.dtype
dtype('O')
According to my profiling, the pseudo dot product is about 500 times slower than a dot product using matrixes with integer values. Pointer to the profiling code is given below.
I am interested in the reasons of the slowness and if there are ways to mitigate them.
Some of the reasons of the slowness may be:
The memory layout of pseudo would not use contiguous memory. According to this, numpy uses pointers with object types. During matrix multiplication, bunch of pointer dereferences may occur instead of directly reading from contiguous memory.
Numpy multiplication may not use the optimized internal compiled implementations. (BLAS, ATLAS etc.) According to this, various conditions should hold for falling back to the optimized implementation. Using custom objects may break those.
Are there other factors in play? Any recommendations for improvement?
The starting point of all this was this question. There, the OP is looking for a “custom dot product”. An operation that visits the elements of two matrices similar to the dot product operation, but does something else than multiplying the corresponding elements of columns and rows. In an answer, I recommended a custom object that overwrites the __mul__
function. But the numpy.dot performance is very slow with that approach. The code that does the performance measurement can be seen in that answer too.
Code showing the PseudoBinary class and dot product execution.
#!/usr/bin/env python
from __future__ import absolute_import
from __future__ import print_function
import numpy
class PseudoBinary(object):
def __init__(self,i):
self.i = i
def __mul__(self,rhs):
return PseudoBinary(self.i or rhs.i)
__rmul__ = __mul__
__imul__ = __mul__
def __add__(self,rhs):
return PseudoBinary(self.i + rhs.i)
__radd__ = __add__
__iadd__ = __add__
def __str__(self):
return "P"+str(self.i)
__repr__ = __str__
base = numpy.array(
[[1, 0, 1],
[1, 0, 0],
[1, 0, 1]])
pseudo = numpy.array(
[[PseudoBinary(1), PseudoBinary(0), PseudoBinary(1)],
[PseudoBinary(1), PseudoBinary(0), PseudoBinary(0)],
[PseudoBinary(1), PseudoBinary(0), PseudoBinary(1)]])
baseRes = numpy.dot(base,base)
pseudoRes = numpy.dot(pseudo,pseudo)
print("baseRes\n",baseRes)
print("pseudoRes\n",pseudoRes)
Prints :
baseRes
[[2 0 2]
[1 0 1]
[2 0 2]]
pseudoRes
[[P3 P2 P2]
[P3 P1 P2]
[P3 P2 P2]]
Upvotes: 0
Views: 812
Reputation: 231335
The dot
is an outer product followed by sum on one axis.
For your pseudo
, the dot
is marginally faster than the sum of product equivalents:
In [18]: timeit (pseudo[:,:,None]*pseudo[None,:,:]).sum(axis=1)
75.7 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [19]: timeit np.dot(pseudo, pseudo)
63.9 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For base
, dot
is substantially faster than the equivalent.
In [20]: timeit (base[:,:,None]*base[None,:,:]).sum(axis=1)
13.9 µs ± 24.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [21]: timeit np.dot(base,base)
1.58 µs ± 53.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So with a numeric array, dot
can pass the whole task to optimized compiled code (BLAS or other).
We could get a further idea of how object dtype affects the speed by creating a numeric object array, and comparing the simple element product:
In [28]: baso = base.astype(object)
In [29]: timeit base*base
766 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [30]: timeit baso*baso
2.45 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [31]: timeit pseudo*pseudo
13.7 µs ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using 'or' (|
) instead of *
, we can calculate the same thing as pseudo
but with base
:
In [34]: (base[:,:,None] | base[None,:,:]).sum(axis=1)
Out[34]:
array([[3, 2, 2],
[3, 1, 2],
[3, 2, 2]], dtype=int32)
In [35]: timeit (base[:,:,None] | base[None,:,:]).sum(axis=1)
15.1 µs ± 492 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Upvotes: 1
Reputation: 280207
Pretty much anything you do with object arrays is going to be slow. None of the reasons NumPy is usually fast apply to object arrays.
__add__
and __mul__
.PseudoBinary
object and a new __dict__
for that object on every element __add__
or __mul__
.Basically, every reason doing Python math without NumPy is slow also applies to doing anything with object arrays.
As for how to improve your performance? Don't use object arrays. Use regular arrays, and either find a way to implement the thing you want in terms of the operations NumPy provides, or write out the loops explicitly and use something like Numba or Cython to compile your code.
Upvotes: 1