Calling __host__ __device__ function inside __global__ function causing an overhead

Question

This is the following question from this thread. My __global__ function contains only a single API Geoditic2ECEF(GPS gps). It took 35ms to execute that global function with a single API. However, if I write the entire code of __host__ __device__ Geoditic2ECEF(GPS gps) in the __global__ function rather than calling it as an API, the __global__ function took only 2 ms to execute. It seems like calling an __host__ __device__ API inside __global__ function causing a mysterious overhead.

This is the PTX output when I used the API

ptxas info    : Compiling entry function '_Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii' for 'sm_52'
ptxas info    : Function properties for _Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info    : Used 9 registers, 404 bytes cmem[0]

This is the PTX output when I dont use the API

ptxas info    : Compiling entry function '_Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii' for 'sm_52' 
ptxas info    : Function properties for _Z16cudaCalcDistanceP7RayInfoPK4GPS3PK6float6PK9ObjStatusPKfSB_SB_fiiiiii     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 
ptxas info    : Used 2 registers, 404 bytes cmem[0]

The only difference is that the API version used 9 registers while the non-API version used 2 registers. What can I deduce from this information.

In file utils.cu, I defined following structs and API

struct GPS {
    float latitude;
    float longtitude;
    float height;
};

struct Coordinate
{
    __host__ __device__ Coordinate(float x_ = 0, float y_ = 0, float z_= 0)
    {
        x = x_;
        y = y_;
        z = z_;
    }
    __host__ __device__ float norm()
    {
        return sqrtf(x * x + y * y + z * z);
    }

    float x;
    float y;
    float z;
};

__host__ __device__ Coordinate Geoditic2ECEF(GPS gps)
{
    Coordinate result;

    float a = 6378137;
    float b = 6356752;
    float f = (a - b) / a;
    float e_sq = f * (2 - f);

    float lambda = gps.latitude / 180 * M_PI;
    float phi = gps.longtitude / 180 * M_PI;

    float N = a / sqrtf(1 - e_sq * sinf(lambda) * sinf(lambda));
    result.x = (gps.height + N) * cosf(lambda) * cosf(phi);
    result.y = (gps.height + N) * cosf(lambda) * sinf(phi);
    result.z = (gps.height + (1 - e_sq) * N) * sinf(lambda);
    return result;
}

In main.cu, I have following functions

__global__ void cudaCalcDistance(GPS* missile_cur,
                                 int num_faces, int num_partialPix)
{
    int partialPixIdx = threadIdx.x + IMUL(blockIdx.x, blockDim.x);
    int faceIdx = threadIdx.y + IMUL(blockIdx.y, blockDim.y);

    if(faceIdx < num_faces && partialPixIdx < num_partialPix)
    {
        Coordinate missile_pos;
        // API version
        missile_pos = Geoditic2ECEF(missile_cur->gps);
        // non_API version
//        float a = 6378137;
//        float b = 6356752;
//        float f = (a - b) / a;
//        float e_sq = f * (2 - f);

//        float lambda = missile_cur->latitude / 180 * M_PI;
//        float phi = missile_cur->longtitude / 180 * M_PI;

//        float N = a / sqrtf(1 - e_sq * sinf(lambda) * sinf(lambda));
//        missile_pos.x = (missile_cur->height + N) * cosf(lambda) * cosf(phi);
//        missile_pos.y = (missile_cur->height + N) * cosf(lambda) * sinf(phi);
//        missile_pos.z = (missile_cur->height + (1 - e_sq) * N) * sinf(lambda);

    }
}

void calcDistance(GPS * data)
{
    int num_partialPix = 10000;
    int num_surfaces = 4000;


    dim3 blockDim(16, 16);
    dim3 gridDim(ceil((float)num_partialPix / threadsPerBlock),
                 ceil((float)num_surfaces / threadsPerBlock));

    cudaCalcDistance<<>>(data,                                 
                             m_Rb2c_cur,num_surfaces,num_partialPix);
    gpuErrChk(cudaDeviceSynchronize());
}

int main()
{
    GPS data(11, 120, 32);
    GPS *d_data;
    gpuErrChk(cudaMallocManaged((void**)&d_data, sizeof(GPS)));
    gpuErrChk(cudaMemcpy(d_data, &data, sizeof(GPS), cudaMemcpyHostToDevice));
    calcDistance(d_data);
    gpuErrChk(cudaFree(d_data));
}

Robert Crovella · Accepted Answer

You don't seem to have asked a question that I can see, so I will assume your question is something like "what is this mysterious overhead and what are my options to mitigate it?"

When the call to a __device__ function is in a different compilation unit than the definition of that function, the compiler cannot inline that function (generally).

This can have a variety of performance impacts:

The call instruction creates some overhead
the function call has an ABI that reserves registers, this creates register pressure which may affect code performance
the compiler may have to transfer additional function parameters outside of registers, via the stack. This adds additional overhead.
The compiler cannot (generally) optimize across the function call boundary.

All of these can create performance impacts to varying degrees, and you can find other questions here on the cuda tag which mention these.

The most common solutions I know of are:

Move the definition of the function to the same compilation unit as the calling environment (and, if possible, remove -rdc=true or -dc from compilation command line).
In recent CUDA versions, make use of link-time optimization.

Calling host device function inside global function causing an overhead

Answers (1)

Related Questions