selena731
selena731

Reputation: 73

CL_MEM_ALLOC_HOST_PTR slower than CL_MEM_USE_HOST_PTR

So I've been playing around with OpenCL for a bit now and testing the speeds of memory transfer between host and device. I was using Intel OpenCL SDK and running on the Intel i5 Processor with integrated graphics. I then discovered clEnqueueMapBuffer instead of clEnqueueWriteBuffer which turned out to be faster by almost 10 times when using pinned memory like so:

int amt = 16*1024*1024;
...
k_a = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, a, NULL);
k_b = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, b, NULL);
k_c = clCreateBuffer(context,CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, sizeof(int)*amt, ret, NULL);

int* map_a = (int*) clEnqueueMapBuffer(c_q, k_a, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_b = (int*) clEnqueueMapBuffer(c_q, k_b, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_c = (int*) clEnqueueMapBuffer(c_q, k_c, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
clFinish(c_q); 

Where a b and ret are 128 bit aligned int arrays. The time came out to about 22.026186 ms, compared to 198.604528 ms using clEnqueueWriteBuffer However, when I changed my code to

k_a = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);
k_b = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);
k_c = clCreateBuffer(context,CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, NULL);

int* map_a = (int*)clEnqueueMapBuffer(c_q, k_a, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_b = (int*)clEnqueueMapBuffer(c_q, k_b, CL_TRUE, CL_MAP_READ, 0, sizeof(int)*amt, 0, NULL, NULL, &error);
int* map_c = (int*)clEnqueueMapBuffer(c_q, k_c, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &error);

/** initiate map_a and map_b **/

the time increases to 91.350065 ms

What could be the problem? Or is it a problem at all?

EDIT: This is how I initialize the arrays in the second code:

for (int i = 0; i < amt; i++)
{
    map_a[i] = i;
    map_b[i] = i;
}

And now that I check, map_a and map_b do contain the right elements at the end of the program, but map_c contains all 0's. I did this:

clEnqueueUnmapMemObject(c_q, k_a, map_a, 0, NULL, NULL);
clEnqueueUnmapMemObject(c_q, k_b, map_b, 0, NULL, NULL);
clEnqueueUnmapMemObject(c_q, k_c, map_c, 0, NULL, NULL);

and my kernel is just

__kernel void test(__global int* a, __global int* b, __global int* c)
{
    int i = get_global_id(0);
    c[i] = a[i] + b[i];
}   

Upvotes: 7

Views: 6355

Answers (2)

Austin
Austin

Reputation: 1020

My understanding is that CL_MEM_ALLOC_HOST_PTR allocates but doesn't copy. Does the 2nd block of code actually get any data onto the device?

Also, clCreateBuffer when used with CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR shouldn't require clEnqueueWrite, as the buffer is created with the memory pointed to by void *host_ptr.

Using "pinned" memory in OpenCL should be a process like:

   int amt = 16*1024*1024;
   int Array[] = new int[amt];
   int Error = 0;

    //Note, since we are using NULL for the data pointer, we HAVE to use CL_MEM_ALLOC_HOST_PTR
    //This allocates memory on the devices
    cl_mem B1 = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(int)*amt, NULL, &Error); 

    //Map the Device memory to host memory, aka pinning it
    int *host_ptr = clEnqueueMapBuffer(queue, B1, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, sizeof(int)*amt, 0, NULL, NULL, &Error); 

    //Copy from host memory to pinned host memory which copies to the card automatically`
    memcpy(host_ptr, Array, sizeof(int)*amt); 

    //Call your kernel and everything else and memcpy back the pinned back to host when
    //you are done

Edit: One final thing you can do to speed up the program is to not make the memory read/write blocking by using CL_FALSE instead of CL_TRUE. Just make sure to call clFinish() before data gets copied back to the host so that the command queue is emptied and all commands are processed.

Source: OpenCL In Action

Upvotes: 1

Dithermaster
Dithermaster

Reputation: 6333

With the right combination of flags, you should be able to achieve "zero copy" (i.e. very fast) map/unmap on Intel Integrated Graphics since there is no need for a "CPU to GPU" copy since they both use the same memory (that's what the "Integrated" means). Read the Intel OpenCL Optimization Guide section on memory.

Upvotes: 0

Related Questions