Reputation: 33
I'm trying to get started with pyOpenCL and GPGPU in general.
For the below dot product code I'm getting fairly different results between the GPU and CPU versions. What am I doing wrong?
The difference of ~0.5% seems large for floating point errors to account for the difference. The difference does seem to increase with array size (~9e-8 relative difference with array size of 10000). Maybe it's an issue with combining results across blocks...? Either way, color me disconcerted.
I don't know if it matters: I'm running this on a MacBook Air, Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz, with Intel HD Graphics 5000.
Thanks in advance.
import pyopencl as cl
import numpy
from pyopencl.reduction import ReductionKernel
import pyopencl.clrandom as cl_rand
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
dot = ReductionKernel( ctx, \
dtype_out = numpy.float32, \
neutral = "0", \
map_expr = "x[i]*y[i]", \
reduce_expr = "a+b", \
arguments = "__global const float *x, __global const float *y"
)
x = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
y = cl_rand.rand(queue, 100000000, dtype = numpy.float32)
x_dot_y = dot(x,y).get() # GPU: array(25001304.0, dtype=float32)
x_dot_y_cpu = numpy.dot(x.get(), y.get()) # CPU: 24875690.0
print abs(x_dot_y_cpu - x_dot_y)/x_dot_y # 0.0050496689740063489
Upvotes: 3
Views: 966
Reputation: 9925
The order in which values are reduced will likely be very different between these two methods. Across large data sets, the tiny errors in floating point rounding can soon add up. There could also be other details about the underlying implementations that affect the precision of the result.
I've run your example code on my own machine and get a similar sort of difference in the final result (~0.5%). As a data point, you can implement a very simple dot product in raw Python and see how much that differs from both the OpenCL result and from Numpy.
For example, you could add something simple like this to your example:
x_dot_y_naive = sum(a*b for a,b in zip(x.get(), y.get()))
Here's the results I get on my machine:
OPENCL: 25003466.000000
NUMPY: 24878146.000000 (0.5012%)
NAIVE: 25003465.601387 (0.0000%)
As you can see, the naive implementation is closer to the OpenCL version than Numpy is. One explanation for this could be that Numpy's dot
function probably makes use of fused multiply-add (FMA) operations, which will change how intermediate results are rounded. Without any compiler options to tell it otherwise, OpenCL should be fully complying to the IEE-754 standard, rather than using the faster FMA operations.
Upvotes: 2