Reputation: 417
I am trying to get this code running fast in python however I am having trouble getting it to run anywhere near the speed it runs in MATLAB. The problem seems to be this for loop which takes about 2 second to run when the number "SRpixels" is approximately equal to 25000.
I cant seem to find any way to trim this down any further, and I am looking for suggestions.
The datatypes for the numpy arrays below are float32 for all except the **_Location[] which are uint32.
for j in range (0,SRpixels):
#Skip data if outside valid range
if (abs(SR_pointCloud[j,0]) > SR_xMax or SR_pointCloud[j,2] > SR_zMax or SR_pointCloud[j,2] < 0):
pass
else:
RIGrid1_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid1Center) / gridSize)
RIGrid1_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid1LowerBound) / gridSize)
RIGrid1_Count[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += 1
RIGrid1_Sum[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1]
RIGrid1_SumofSquares[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]
RIGrid2_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid2Center) / gridSize)
RIGrid2_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid2LowerBound) / gridSize)
RIGrid2_Count[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += 1
RIGrid2_Sum[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1]
RIGrid2_SumofSquares[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]
I did attempt to use Cython, where I replaced j with a cdef int j
and compiled. There was no noticeable performance gain. Anyone have suggestions?
Upvotes: 1
Views: 445
Reputation: 97331
Try vectorization the calculation first, if you must do calculation element by element, here is some speedup hint:
Calculation with NumPy scalar is much slower than builtin scalars. array[i, j] will get a numpy scalar, and array.item(i,j) will return a builtin scalar.
functions in math module is faster than numpy when do scalar calculation.
Here is an example:
import numpy as np
import math
a = np.array([[1.1, 2.2, 3.3],[4.4, 5.5, 6.6]])
%timeit np.floor(a[0,0]*2)
%timeit math.floor(a[0,0]*2)
%timeit np.floor(a.item(0,0)*2)
%timeit math.floor(a.item(0,0)*2)
output:
100000 loops, best of 3: 10.2 µs per loop
100000 loops, best of 3: 3.49 µs per loop
100000 loops, best of 3: 6.49 µs per loop
1000000 loops, best of 3: 851 ns per loop
So change np.floor
to math.floor
, change SR_pointCloud[j,0]
to SR_pointCloud.item(j,0)
will speedup the loop alot.
Upvotes: 1
Reputation: 9888
Vectorization is almost always the best way to speed up numpy code, and much of this seems vectorizable. To start, for example, the location arrays seem quite simple to do:
# these are all of your j values
inds = np.arange(0,SRpixels)
# these are the j values you don't want to skip
sel = np.invert((abs(SR_pointCloud[inds,0]) > SR_xMax) | (SR_pointCloud[inds,2] > SR_zMax) | (SR_pointCloud[inds,2] < 0))
RIGrid1_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid1Center) / gridSize)
RIGrid1_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid1LowerBound) / gridSize)
RIGrid2_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid2Center) / gridSize)
RIGrid2_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid2LowerBound) / gridSize)
This has no python loop.
The rest are trickier and will depend upon what you are doing, but should also be vectorizable if you think about them in this way.
If you really have something that can't be vectorized and must be done with a loop—I've only had this happen a few times—I'd suggest Weave over Cython. It's harder to use, but should give speeds comparable to C.
Upvotes: 5