theKunz
theKunz

Reputation: 443

nvprof not picking up any API calls or kernels

I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the Nvidia dev blogs here:

https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

Code:

int main()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    return 0;
}

Command line:

-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test

So I replicated it word for word, line by line, and ran identical command line arguments. Unfortunately my result was the same:

-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.

==85454== API calls:
No API activities were profiled. 

I am running Nvidia toolkit 7.5

If anyone knows what what I'm doing wrong I'd be grateful to know the answer.

-----EDIT-----

So I modified the code to be

#include<cuda_profiler_api.h>

int main()
{
    cudaProfilerStart();
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    cudaProfilerStop();
    return 0;
}

Unfortunately it did not change things.

Upvotes: 2

Views: 5082

Answers (2)

Kratos
Kratos

Reputation: 339

It's a bug with unified memory profiling, the flag

--unified-memory-profiling off  ./profile_test

resolves all problems for me.

Upvotes: 7

Grzegorz Szpetkowski
Grzegorz Szpetkowski

Reputation: 37904

You need to call cudaProfilerStop() (for Runtime API) before exiting from thread. This allows nvprof to collect all necessary data.

According to CUDA doc:

To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call cudaProfilerStop() or cuProfilerStop(). Doing so forces buffered profile information on corresponding context(s) to be flushed.

Upvotes: 2

Related Questions