Reputation: 3931
I would like to execute some code on different cores of my GPU (Apple MacBook) in cpp / OpenCL, and found the code below that can execute code on what seems a single core. How to adapt it to send the code to multiple cores ?
int gpu = 1;
clGetDeviceIDs(NULL, gpu ? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &device_id, NULL);
// Create a compute context
//
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create a command commands
//
commands = clCreateCommandQueue(context, device_id, 0, &err);
// Create the compute program from the source buffer
//
program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);
// Build the program executable
//
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// Create the compute kernel in the program we wish to run
//
kernel = clCreateKernel(program, "square", &err);
// Create the input and output arrays in device memory for our calculation
//
input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL);
output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL);
// Write our data set into the input array in device memory
//
clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0, NULL, NULL);
// Set the arguments to our compute kernel
//
clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
clSetKernelArg(kernel, 2, sizeof(unsigned int), &count);
// Get the maximum work group size for executing the kernel on the device
//
clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE, sizeof(local), &local, NULL);
// Execute the kernel over the entire range of our 1d input data set
// using the maximum number of work group items for this device
//
global = count;
clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
// Wait for the command commands to get serviced before reading back results
//
clFinish(commands);
Upvotes: 0
Views: 165
Reputation: 5754
The C++ control structure code that launches OpenCL kernels almost always is single-threaded. A single core is more than enough to feed the GPU with kernels and keep the queue full at any time. Using several threads to enqueue kernels faster makes little sense.
If you want to run OpenCL kernels on a multi-core CPU instead of a GPU, just select the CPU as OpenCL device. OpenCL code on CPU always runs on all cores. While the CPU likely has more memory available, expect it to be much slower than even the integrated GPU.
For a simple start with OpenCL and C++, have a look at this open source OpenCL-Wrapper.
Upvotes: 2