Reputation: 11
Below problem is fixed in nvidia's new driver release 331.xx, currently available as beta driver.
Thanks for all your comments!
I have a multi-platform application that does many fragment operations and gpgpu stuff on OpenGL textures. The application makes heavy use of GL/CL interop, each texture may be bound to an OpenCL image and manipulated using CL kernels.
The problem is, the application runs fast on AMD cards, both Linux and Windows. On NVIDIA cards, it runs fast on Linux, but very slowly on Windows 7. Problem seems to be enqueueAcquireGLObjects and enqueueReleaseGLObjects. I have created a minimal sample, demonstrating the bad performance by simply:
Results (mean time for executing acquire, release, finish)
I have tried several different drivers from nvidia, from older 295.73 to current beta drivers 326.80, all showing the same behaviour.
My question now is, is the nvidia driver seriously broken or am I doing something wrong here? The code runs fast on linux, so it cant be a general problem with nvidia support for OpenCL. The code runs fast on AMD+Win, so it can not be a problem with my code being not optimized for Windows. Optimizing the code by, for example, changing the cl images to read/write-only is senseless, since performance hit is almost factor 30!
Below you can find the relevant code of my test case, I could provide full source code, too.
relevant code for context creation
{ // initialize GLEW
glewInit();
}
{ // initialize CL Context, sharing GL Contet
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl_context_properties cps[] = {
CL_GL_CONTEXT_KHR,(cl_context_properties)wglGetCurrentContext(),
CL_WGL_HDC_KHR,(cl_context_properties)wglGetCurrentDC(),
CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()),
0};
std::vector<cl::Device> devices;
platforms[0].getDevices((cl_device_type)CL_DEVICE_TYPE_GPU, &devices);
context_ = new cl::Context(devices, cps, NULL, this);
queue_ = new cl::CommandQueue(*context_, devices[0]);
}
relevant code for creating textures and sharing CL images
width_ = 1600;
height_ = 1200;
float *data = new float[ 1600*1200*4 ];
textures_.resize(2);
glGenTextures(2, textures_.data());
for (int i=0;i<2;i++) {
glBindTexture(GL_TEXTURE_2D, textures_[i]);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
// "data" pointer holds random/uninitialized data, do not care in this example
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F_ARB, width_,height_, 0, GL_RGBA, GL_FLOAT, data);
}
delete data;
{ // create shared CL Images
#ifdef CL_VERSION_1_2
clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
clImages_.push_back(cl::ImageGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#else
clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[0]));
clImages_.push_back(cl::Image2DGL(*context_, CL_MEM_READ_WRITE, GL_TEXTURE_2D, 0, textures_[1]));
#endif
}
relevant code for one acquire, release, finish cycle
try {
queue_->enqueueAcquireGLObjects( &clImages_ );
queue_->enqueueReleaseGLObjects( &clImages_ );
queue_->finish();
} catch (cl::Error &e) {
std::cout << e.what() << std::endl;
}
Upvotes: 1
Views: 1714
Reputation: 2565
I'm gonna make the assumption that since you are using OpenGL, you display something on the screen after the OCL computation.
So based on that assumption, my first thought would be to check in the NVIDIA control panel if the VSync is enable and if yes to disable it and retest.
As far as I recall, the default options regarding vsync are different for AMD and NVIDIA; which would explain the difference between the two GPUs.
Just in case, here is a post that explain how vsync can slow down the rendering.
Upvotes: 1