Bub Espinja
Bub Espinja

Reputation: 4571

Concurrent execution of two processes sharing a Tesla K20

I have been experiencing a strange behaviour when I launch 2 instances of a kernel in order to run at the same time while sharing the GPU resources.

I have developed a CUDA kernel which aims to run in a single SM (Multiprocessor) where the threads perform an operation several times (with a loop).

The kernel is prepared to create only a block, therefore to use only one SM.

simple.cu

#include <cuda_runtime.h>
#include <stdlib.h>
#include <stdio.h>
#include <helper_cuda.h>
using namespace std;

__global__ void increment(float *in, float *out)
{
    int it=0, i = blockIdx.x * blockDim.x + threadIdx.x;
    float a=0.8525852f;

    for(it=0; it<99999999; it++)
             out[i] += (in[i]+a)*a-(in[i]+a);
}

int main( int argc, char* argv[])
{
    int i;
    int nBlocks = 1;
    int threadsPerBlock = 1024;
    float *A, *d_A, *d_B, *B;
    size_t size=1024*13;

    A = (float *) malloc(size * sizeof(float));
    B = (float *) malloc(size * sizeof(float));

    for(i=0;i<size;i++){
            A[i]=0.74;
            B[i]=0.36;
    }

    cudaMalloc((void **) &d_A, size * sizeof(float));
    cudaMalloc((void **) &d_B, size * sizeof(float));

    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);

    increment<<<nBlocks,threadsPerBlock>>>(d_A, d_B);

    cudaDeviceSynchronize();

    cudaMemcpy(B, d_B, size, cudaMemcpyDeviceToHost);

    free(A);
    free(B);

    cudaFree(d_A);
    cudaFree(d_B);

    cudaDeviceReset();

    return (0);
}

So if I execute the kernel:

time ./simple

I get

real 0m36.659s user 0m4.033s sys 0m1.124s

Otherwise, If I execute two instances:

time ./simple & time ./simple

I get for each process:

real 1m12.417s user 0m29.494s sys 0m42.721s

real 1m12.440s user 0m36.387s sys 0m8.820s

As far as I know, the executions should run concurrently lasting as one (about 36 seconds). However, they last twice the base time. We know that the GPU has 13 SMs, each one should execute one block, thus the kernels only create 1 block.

Are they being executed in the same SM?

Shouldn’t they running concurrently in different SMs?

EDITED

In order to make me clearer I will attach the profiles of the concurrent execution, obtained from nvprof:

Profile, first instance simple.cu profile, first instance

Profile, second instance simple.cu profile, second instance

Now, I would like to show you the behavior of the same scenario but executing concurrently two instances of matrixMul sample:

Profile, first instance enter image description here

Profile, second instance enter image description here

As you can see, in the first scenario, a kernel waits for the other to finish. While, in the second scenario (matrixMul), kernels from both contexts are running at the same time.

Thank you.

Upvotes: 0

Views: 228

Answers (1)

talonmies
talonmies

Reputation: 72342

When you run two separate processes using the same GPU, they each have their own context. CUDA doesn't support having multiple contexts on the same device simultaneously. Instead, each context competes for the device in an undefined manner, with driver level context switching. That is why the execution behaves as if the processes are serialised -- effectively they are, but at a driver rather than GPU level.

There are technologies available (MPS, Hyper-Q) which can do what you want, but the way you are trying to do this won't work.


Edit to respond to the update in your question:

The example you have added using the MatrixMul sample doesn't show what you think it does. That application runs 300 short kernels and computes a performance number over the average of those 300 runs. Your profiling display has been set to a very coarse timescale resolution so that it looks like there is a single long running kernel launch, when in fact it is a series of very short running time kernels.

To illustrate this, consider the following:

This is a normal profiling run for a single MatrixMul process running on a Kepler device. Note that there are many individual kernels running directly after one another. enter image description here

These are the profiling traces of two simultaneous MatrixMul processes running on the same Kepler device: enter image description here enter image description here

Note that there are gaps in the profile traces of each process, this is where context switching between the two processes is occurring. The behaviour is identical to your original example, just at a much finer time granularity. As has been repeated a number of times by several different people in the course of this discussion -- CUDA doesn't support multiple contexts on the sample device simultaneously using the standard runtime API. The MPS server does allow this by adding a daemon which reimplements the API with a large shared internal Hyper-Q pipeline, but you are not using this and it has no bearing on the results you have shown in this question.

Upvotes: 3

Related Questions