Reputation: 491
I have the following OpenCL code, using the C++ Wrapper and Intel's OpenCL toolkit:
#include <Eigen/StdVector>
...
typedef Sample_t float
typedef std::vector<Sample_t, Eigen::aligned_allocator<Sample_t> > SampleArray;
...
SampleArray data(ns * nt);
...
mdata = cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(Sample_t) * data.size());
queue.enqueueWriteBuffer(mdata, CL_FALSE, 0, sizeof(Sample_t) * data.size(), &data[0]);
When it is compiled with -O3, march=native and mtune=native flag, it causes the following segmentation fault coming from TBB code:
__memcpy_sse2_unaligned() at memcpy-sse2-unaligned.S:116 0x7ffff6e64ba4
Without any optimizations, the program runs fine.
I traced the problem to the queue.enqueueWriteBuffer call, without it I have no problems whatsoever.
I have tried to comment out portions of the code that modify the variable "data", in case I was accessing invalid memory positions, but the problem persists.
If I remove the aligned_allocator from the std::vector, the build without optimizations also starts to break.
In total I have 70MB that I'm trying to store in this buffer, far less than the 3.8GB reported by CL_DEVICE_MAX_MEM_ALLOC_SIZE. But if I reduce the size of the array, the problem stops. The size that I tried in this later case was 5.
I also decided to print the address allocated by the vector and it is 0x7f21b797f010, so it is aligned to at least 16 bytes.
EDIT: Regarding multithreading, the creation of the array, as well as the OpenCL operations happen in the same method, and in the main thread. The command queue was not created with asynchonous flags, and there is a flush() operation after the buffer write.
What could be the problem?
Thank you
Upvotes: 1
Views: 804
Reputation: 23438
As confirmed in the conversation in comments, the problem here is that the enqueueWriteBuffer()
operation is non-blocking (CL_FALSE
passed as the blocking argument) and that the source buffer (SampleArray
vector) goes out of scope before the underlying copy operation is guaranteed to have completed.
There are at least 4 possible solutions:
enqueueWriteBuffer()
. As the documentation indicates, the source buffer will not be accessed once the function returns in that case.clWaitForEvents()
or call clFinish()
before the SampleArray()
goes out of scope. This is only really preferable to the blocking variant if your program is doing anything substantial in the interim.enqueueWriteBuffer()
: create a buffer with a NULL source, map it into your application's memory space, write the data to it, then unmap it. This potentially avoids copying altogether, at least on integrated GPUs/APUs)These are roughly in increasing order of parallelism/efficiency.
Upvotes: 3