Reputation: 6623
I have a CUDA kernel where there are many operations and few branches. It looks like
__global__
void kernel(Real *randomValues, Real mu, Real sigma)
{
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = blockDim.x * blockIdx.x + threadIdx.x;
if ( row >= cnTimeSteps || col >= cnPaths ) return;
Real alphaLevel = randomValues[row*cnPaths+col];
Real q = 0.0;
Real x = 0.0;
if ( alphaLevel < p_low)
{
q = sqrt( -2*log( alphaLevel ) );
x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
else if ( alphaLevel < p_high )
{
q = alphaLevel-0.5;
Real r = q*q;
x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
}
else
{
q = sqrt( -2*log( 1.0-alphaLevel ) );
x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
randomValues[row*cnPaths+col] = sigma * x + mu;
}
where all the a
's, b
's, c
's and d
's are constant values (in the device constant memory)
static __device__ __constant__ Real a1 = 1.73687;
static __device__ __constant__ Real a2 = 1.12321100;
and so on.
After profiling the kernel I found that the theoretical occupancy is 100% but I am getting no more than 60%.
I went through this and this GTC talks to try to optimize my kernel.
On one side I have that the IPC reports an average of 1.32 issued instructions and 0.62 executed. The instruction serialization is about 50% but the SM activity is almost 100%. On the other hand, there are around 38 active warps but 8 are eligible to execute the next instruction but on warp issue efficiency I get that around 70% of the cycles there is no eligible warp. The stall reasons are reported as "Other" which I think has to do with the computation of the log
and sqrt
.
Is the first time I use Nsight Visual Studio so I'm trying to figure out the meaning of all the performance analysis. BTW my card is a Quadro K4000.
Upvotes: 0
Views: 234
Reputation: 11529
1) How can the SM activity be 99.82% if most of the cycles there is no eligible warp?
A warp is active if registers and a warp slot are allocated to the warp. A SM is active if at least 1 warp is active on the SM.
SM activity should not be confused with efficiency.
2) How can I reduce stall?
In the case of code above the warps are stalled waiting for the the double precision execution units to be available. The Quadro K4000 has a throughput of 8 threads/cycle for double precision operations.
The remedies for this problem are: a. Decrease the number of double precision operations. For example, moving consecutive operations to float may significantly improve performance as single precision floating point throughput is 24x double precision throughput. b. Execute the kernel on a GK110 which has 8X the double precision throughput of a GK10x.
Increasing the achieved occupancy may not increase the performance of this kernel on the K4000. You have provided insufficient information to determine why achieved occupancy is significantly less than theoretical occupancy.
The Achieved FLOPs experiment can be used to confirm if the kernel performance is bound by double precision throughput.
3) As threads in a warp may not go into the same branch, requests to constant memory are probably seralized, is this true? Should I put those constants in global memory (maybe use shared memory also)?
The code has no memory address divergence in the constant memory loads. Warp control flow divergence just means that on each request on a portion of the threads will be active.
The initial global load may not be coalesced. You need to provide the value of cnPaths for someone to review. You could also look at the Memory experiments or the Source Correlated experiments.
The if and else statement may be able to be coded in a more efficient manner to allow the compiler to use predication instead of divergence branches.
Upvotes: 3
Reputation: 5482
I assume your Real datatype is a typedef of float. You can try add the f suffix to the constant values that are used preventing the compiler to add unecessary casts.
E.g.
q = alphaLevel-0.5;
The constant 0.5 is a double value, alphaLevel is a real=float value. alphaLevel will be casted to a double. q is of type float. The result from the substraction must be downcasted to a float again.
If Real is a typedef of dobule all your calculations mix double and float resulting in the same up and down casting.
Upvotes: 1
Reputation: 212959
You could probably reduce the impact of warp divergence by simplifying:
if ( alphaLevel < p_low)
{
q = sqrt( -2*log( alphaLevel ) );
x = (((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
else if ( alphaLevel < p_high )
{
q = alphaLevel-0.5;
Real r = q*q;
x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
}
else
{
q = sqrt( -2*log( 1.0-alphaLevel ) );
x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
to:
if ( alphaLevel >= p_low && alphaLevel < p_high )
{
q = alphaLevel-0.5;
Real r = q*q;
x= (((((a1*r+a2)*r+a3)*r+a4)*r+a5)*r+a6)*q / (((((b1*r+b2)*r+b3)*r+b4)*r+b5)*r+1);
}
else
{
alphaLevel = alphaLevel >= p_low ? 1.0-alphaLevel : alphaLevel;
q = sqrt( -2*log( alphaLevel ) );
x = -(((((c1*q+c2)*q+c3)*q+c4)*q+c5)*q+c6) / ((((d1*q+d2)*q+d3)*q+d4)*q+1);
}
Upvotes: 0