Reputation: 1128
I wrote a small OpenCL application which calculates the product of two matrices. Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. Here is the host code:
/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>
#include <vector>
#include <iostream>
#include "util.hpp" // utility library
#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384) // ROWS of vectors a, b, and c
#define COLUMNS (16384)
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"
/*Start main()*/
int main(void)
{
int A;
// Fill vectors X and Y with random float values
float* h_x = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_y = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
}
}
float* h_s = new float[ROWS*COLUMNS];
for (int i = 0; i < ROWS; ++i){
for (int j = 0; j < COLUMNS; ++j){
h_s[j + i*COLUMNS] = 0.0;
}
}
/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
// Get all platforms (drivers)
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
if (all_platforms.size() == 0){ // Check for issues
std::cout << " No platforms found. Check OpenCL installation!\n";
exit(1);
}
cl::Platform default_platform = all_platforms[0];
std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";
// Get default device of the default platform
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
if (all_devices.size() == 0){ // Check for issues
std::cout << " No devices found. Check OpenCL installation!\n";
exit(1);
}
cl::Device default_device = all_devices[0];
std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";
// Create an OpenCL context
cl::Context context({ default_device });
cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);
if (program.build({ default_device }) != CL_SUCCESS){
std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
getchar();
exit(1);
}
// create buffers on the device
cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));
//create queue to which we will push commands for the device.
cl::CommandQueue queue(context, default_device);
//write arrays A and B to the device
queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);
StartCounter();
//run the kernel
cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
kernel_add.setArg(0, buffer_X);
kernel_add.setArg(1, buffer_Y);
kernel_add.setArg(2, buffer_S);
kernel_add.setArg(3, buffer_A);
cl::NDRange global(ROWS*COLUMNS);
queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
queue.finish();
std::cout << "Kernel execution time: " << GetCounter() << "ms \n";
//read result C from the device to array C
queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);
/*Print vectors
std::cout << "\nMatrix #1: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_x[i] << "\t ";
}
std::cout << "\n\nMatrix #2: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_y[i] << "\t ";
}
std::cout << "\n\nResult: \n";
for (int i = 0; i<ROWS*COLUMNS; i++){
std::cout << "" << h_s[i] << "\t ";
}*/
getchar();
return 0;
}
and here is the kernel:
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A){
S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];
}
Could you please explain me the reason? I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups).
EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. The CPU is an i7-4771 and the system has 8GB RAM.
Upvotes: 1
Views: 3666
Reputation: 129364
Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage...
So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;)
To do a little more work in one thread you could simply do:
#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
__global float* X,
__global float* Y,
__global float* S,
__global int *A)
{
for(i = 0; i < NUM_PER_THREAD; i++)
{
size_t index = get_global_id(0)*NUM_PER_THREAD + i;
S[index] = X[index] * Y[index];
}
}
[This will do 1 x 16 blocks. It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix]
Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. Accessing "main" memory involves one of two approaches:
Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float
.
Upvotes: 2