Caian
Caian

Reputation: 491

OpenCL's enqueueWriteBuffer causes __memcpy_sse2_unaligned segmentation fault

I have the following OpenCL code, using the C++ Wrapper and Intel's OpenCL toolkit:

#include <Eigen/StdVector>

...

typedef Sample_t float
typedef std::vector<Sample_t, Eigen::aligned_allocator<Sample_t> > SampleArray;

...

SampleArray data(ns * nt);

...

mdata = cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(Sample_t) * data.size());
queue.enqueueWriteBuffer(mdata, CL_FALSE, 0, sizeof(Sample_t) * data.size(), &data[0]);

When it is compiled with -O3, march=native and mtune=native flag, it causes the following segmentation fault coming from TBB code:

__memcpy_sse2_unaligned() at memcpy-sse2-unaligned.S:116 0x7ffff6e64ba4

Without any optimizations, the program runs fine.

I traced the problem to the queue.enqueueWriteBuffer call, without it I have no problems whatsoever.

I have tried to comment out portions of the code that modify the variable "data", in case I was accessing invalid memory positions, but the problem persists.

If I remove the aligned_allocator from the std::vector, the build without optimizations also starts to break.

In total I have 70MB that I'm trying to store in this buffer, far less than the 3.8GB reported by CL_DEVICE_MAX_MEM_ALLOC_SIZE. But if I reduce the size of the array, the problem stops. The size that I tried in this later case was 5.

I also decided to print the address allocated by the vector and it is 0x7f21b797f010, so it is aligned to at least 16 bytes.

EDIT: Regarding multithreading, the creation of the array, as well as the OpenCL operations happen in the same method, and in the main thread. The command queue was not created with asynchonous flags, and there is a flush() operation after the buffer write.

What could be the problem?

Thank you

Upvotes: 1

Views: 804

Answers (1)

pmdj
pmdj

Reputation: 23438

As confirmed in the conversation in comments, the problem here is that the enqueueWriteBuffer() operation is non-blocking (CL_FALSE passed as the blocking argument) and that the source buffer (SampleArray vector) goes out of scope before the underlying copy operation is guaranteed to have completed.

There are at least 4 possible solutions:

  1. Use the blocking form of enqueueWriteBuffer(). As the documentation indicates, the source buffer will not be accessed once the function returns in that case.
  2. Capture the returned event and call clWaitForEvents() or call clFinish() before the SampleArray() goes out of scope. This is only really preferable to the blocking variant if your program is doing anything substantial in the interim.
  3. Keep the source data around for long enough.
  4. Don't use the copying form of enqueueWriteBuffer(): create a buffer with a NULL source, map it into your application's memory space, write the data to it, then unmap it. This potentially avoids copying altogether, at least on integrated GPUs/APUs)

These are roughly in increasing order of parallelism/efficiency.

Upvotes: 3

Related Questions