Concurrent execution of two processes sharing a Tesla K20

Question

I have been experiencing a strange behaviour when I launch 2 instances of a kernel in order to run at the same time while sharing the GPU resources.

I have developed a CUDA kernel which aims to run in a single SM (Multiprocessor) where the threads perform an operation several times (with a loop).

The kernel is prepared to create only a block, therefore to use only one SM.

simple.cu

#include 
#include 
#include 
#include 
using namespace std;

__global__ void increment(float *in, float *out)
{
    int it=0, i = blockIdx.x * blockDim.x + threadIdx.x;
    float a=0.8525852f;

    for(it=0; it<99999999; it++)
             out[i] += (in[i]+a)*a-(in[i]+a);
}

int main( int argc, char* argv[])
{
    int i;
    int nBlocks = 1;
    int threadsPerBlock = 1024;
    float *A, *d_A, *d_B, *B;
    size_t size=1024*13;

    A = (float *) malloc(size * sizeof(float));
    B = (float *) malloc(size * sizeof(float));

    for(i=0;i>>(d_A, d_B);

    cudaDeviceSynchronize();

    cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);

    free(A);
    free(B);

    cudaFree(d_A);
    cudaFree(d_B);

    cudaDeviceReset();

    return (0);
}

So if I execute the kernel:

time ./simple

I get

real 0m36.659s user 0m4.033s sys 0m1.124s

Otherwise, If I execute two instances:

time ./simple & time ./simple

I get for each process:

real 1m12.417s user 0m29.494s sys 0m42.721s

real 1m12.440s user 0m36.387s sys 0m8.820s

As far as I know, the executions should run concurrently lasting as one (about 36 seconds). However, they last twice the base time. We know that the GPU has 13 SMs, each one should execute one block, thus the kernels only create 1 block.

Are they being executed in the same SM?

Shouldn’t they running concurrently in different SMs?

EDITED

In order to make me clearer I will attach the profiles of the concurrent execution, obtained from nvprof:

Profile, first instance

Profile, second instance

Now, I would like to show you the behavior of the same scenario but executing concurrently two instances of matrixMul sample:

Profile, first instance

Profile, second instance

As you can see, in the first scenario, a kernel waits for the other to finish. While, in the second scenario (matrixMul), kernels from both contexts are running at the same time.

Thank you.

Concurrent execution of two processes sharing a Tesla K20

Answers (1)

Related Questions