skm
skm

Reputation: 5659

Random Occupancy values returned by the "cudaOccupancyMaxActiveBlocksPerMultiprocessor"

I am trying to understand the usage and benefit of the “cudaOccupancyMaxActiveBlocksPerMultiprocessor” method.

I am using a slightly modified version of the sample program present on NVIDIA developer forum. Basically, I am asking the user to provide the size of the array.

My GPU: NVIDIA GeForce GTX 1070

QUESTIONS:

enter image description here

SAMPLE CODE:

Source.cpp

#include "kernel_header.cuh"

#include <algorithm>
#include <iostream>

using namespace std;

int main(int argc, char* argv[])
{
    int N;
    int userSize = 0;

    //ask size to user
    cout << "\n\nType the size of 1D Array: " << endl;
    cin >> userSize;

    N = userSize>0? userSize : 1024; //<<<<<<<<<<<<<<<-------PROBLEM

    int* array = (int*)calloc(N, sizeof(int));
    for (int i = 0; i < N; i++)
    {
        array[i] = i + 1;
        //cout << "i = " << i << " is " << array[i]<<endl;
    }

    launchMyKernel(array, N);

    free(array);


    return 0;
}

kernel_header.cuh

#ifndef KERNELHEADER
#define KERNELHEADER

void launchMyKernel(int* array, int arrayCount);

#endif

kernel.cu

#include "stdio.h"
#include "cuda_runtime.h"

__global__ void MyKernel(int* array, int arrayCount)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < arrayCount)
    {
        array[idx] *= array[idx];
    }
}

void launchMyKernel(int* array, int arrayCount)
{
    int blockSize;   // The launch configurator returned block size 
    int minGridSize; // The minimum grid size needed to achieve the 
                     // maximum occupancy for a full device launch 
    int gridSize;    // The actual grid size needed, based on input size 

    cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,MyKernel, 0, 0);

    // Round up according to array size 
    gridSize = (arrayCount + blockSize - 1) / blockSize;

    MyKernel << < gridSize, blockSize >> > (array, arrayCount);

    cudaDeviceSynchronize();

    // calculate theoretical occupancy
    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxActiveBlocks,
        MyKernel, blockSize,
        0);

    int device;
    cudaDeviceProp props;
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&props, device);

    float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
        (float)(props.maxThreadsPerMultiProcessor /
            props.warpSize);


    printf("\n\nMax. Active blocks found: %d\nOur Kernel block size decided: %d\nWarp Size: %d\nNumber of threads per SM: %d\n\n\n\n", maxActiveBlocks
        , blockSize,
        props.warpSize,
        props.maxThreadsPerMultiProcessor);

    printf("Launched blocks of size %d. Theoretical occupancy: %f\n",
        blockSize, occupancy);
}

Upvotes: 0

Views: 251

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151944

Before asking others for help with a CUDA code that is not working the way you expect, I strongly encourage you to:

  1. Use proper CUDA error checking
  2. run your code with a sanitizer, such as cuda-memcheck or compute-sanitizer

Even if you don't understand the results, the information reported will be useful for those trying to help you.

In your case, you are doing something illegal with your kernel. Specifically, you have passed it host pointers (the one returned by calloc is a host pointer). You pretty much can't use such a pointer in CUDA (i.e. for CUDA device code), and this is a basic CUDA programming principle. To understand one method to structure such a code, so that your kernel can actually do something useful, please refer to the vectorAdd CUDA sample code.

When your kernel attempts to use this host pointer, it makes illegal accesses. At least in my case, when I enter 2048 for the data size, and implement proper CUDA error checking, I observe that the kernel and all subsequent CUDA activity returns an error code, including your call to cudaOccupancyMaxActiveBlocksPerMultiprocessor. That means, that that call is not doing what you expect, and the data it returns is garbage.

So that is at least one reason why you are getting garbage calculation values.

When I fix that issue (e.g. by replacing the calloc with a suitably designed call to cudaMallocManaged), then your code for me reports an occupancy calculation of 1.0, for input data sizes of 512, 1024, and 2048. So there is no variability that I can see, and at best, if you still have questions, I think you would need to restate them (in a new question).

I'm not suggesting that if you fix this, everything will be fine. But this problem is obscuring any ability to make useful analysis.

Upvotes: 3

Related Questions