Dredok
Dredok

Reputation: 805

What is the best strategy to overlap kernel execution and data transfers in a GTX Titan card?

When I try to overlap data transfers and kernel execution It seems like the card is executing all memory transfers in-order, no matter what stream I use.

So, If I issue the following:

The MemcpyA_HtoD_2 will wait till the MemcpyA_DtoH_1 is completed. So no overlapping is achieved. No matter what config of streams I use, the Memcpy operations are always issued in order. So the only way for achieving overlapping involves buffering the outputs or delaying the output transfer till the next iteration.

I use CUDA 5.5, windows 7 x64 and a GTX Titan. All cpu memory is pinned and data_transfers are done using the async version.

See the following screens with the behavior:

issuing, host_to_device -> kernel -> device_to_host (normal behavior) and can not get overlap.

non overlapping

issuing host_to_device -> kernel (avoiding device_to_host after kernel) gets overlap ... because all memory copies are executed in-order, no matter what stream configuration I try.

overlapping

UPDATE

If someone is interested in reproducing this issue, I have coded a synthetic program that shows this undesired behavior. Its a complete VS2010 solution using CUDA 5.5

VS2010 Streams Not Working link

Could someone execute this on linux for testing overlapping?

#include "cuda_runtime.h"
#include "device_launch_parameters.h"


#include <stdio.h>
#define N 1024*1024

__global__ void someKernel(int *d_in, int *d_out) {
    for (int i = threadIdx.x; i < threadIdx.x + 1024; i++) {
        d_out[i] = d_in[i];
    }
}

int main () {
    int *h_bufferIn[100];
    int *h_bufferOut[100];
    int *d_bufferIn[100];
    int *d_bufferOut[100];

    //allocate some memory
    for (int i = 0; i < 100; i++) {
        cudaMallocHost(&h_bufferIn[i],N*sizeof(int));
        cudaMallocHost(&h_bufferOut[i],N*sizeof(int));
        cudaMalloc(&d_bufferIn[i], N*sizeof(int));
        cudaMalloc(&d_bufferOut[i], N*sizeof(int));
    }

    //create cuda streams
    cudaStream_t st[2];
    cudaStreamCreate(&st[0]);
    cudaStreamCreate(&st[1]);

    //trying to overlap computation and memcpys
    for (int i = 0; i < 100; i+=2) {
        cudaMemcpyAsync(d_bufferIn[i], h_bufferIn[i], N*sizeof(int), cudaMemcpyHostToDevice, st[i%2]);
        someKernel<<<1,256, 0, st[i%2]>>>(d_bufferIn[i], d_bufferOut[i]);
        cudaMemcpyAsync(h_bufferOut[i], d_bufferOut[i], N*sizeof(int), cudaMemcpyDeviceToHost, st[i%2]);
        cudaStreamQuery(0);

        cudaMemcpyAsync(d_bufferIn[i+1], h_bufferIn[i+1], N*sizeof(int), cudaMemcpyHostToDevice, st[(i+1)%2]);
        someKernel<<<1,256, 0, st[(i+1)%2]>>>(d_bufferIn[i+1], d_bufferOut[i+1]);
        cudaMemcpyAsync(h_bufferOut[i+1], d_bufferOut[i+1], N*sizeof(int), cudaMemcpyDeviceToHost, st[(i+1)%2]);
        cudaStreamQuery(0);
    }
    cudaDeviceSynchronize();
}

Upvotes: 4

Views: 1188

Answers (1)

Aperture Laboratories
Aperture Laboratories

Reputation: 264

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away.

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

  1. Make sure your concurrency problem is not related to other issues like old drivers (including bios) etc.
  2. Go to computer>properties
  3. Select advanced system settings on the left side
  4. Go to the Advanced tab
  5. On Performance click settings
  6. In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

  1. Right click on desktop, select personalize
  2. Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but ONLY a fresh install solved ALL the problems!

I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.

Upvotes: 1

Related Questions