Reputation: 1042
Hello my issue is this any advice will be greatfully accepted:
I have array of structs (representating Particles) but for simplify I have array containing only True values at start (Particle.exist = True). I am running my own CUDA kernel function on this array and in some cases the True value is changed to False. After that I have to move this Value to the end of array for better optimalization (No more working with dead Particle (exist = False)).
I have theoretically two options how to do this...
Second option should be better choice but I don´t know how to do this in parallel. I could Have 1 000 000 Particles so shifting in one thread is not good idea...
Here is example of my code. I put Todo in part where I need shift array
struct Particle
{
float2 position;
float angle;
bool exists;
};
__global__ void moveParticles(Particle* particles, const unsigned int lengthOfParticles, const Particle* leaders, const unsigned int lengthOfLeaders, const unsigned int sizeOfLeader, const float speedFactor, const cudaTextureObject_t heightMapTexture)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
while (idx < lengthOfParticles)
{
// If particle does not exist then do nothing and skip
if (!particles[idx].exists) { idx += skip; continue; }
float bestLength = 3.40282e+038;
unsigned int bestLeaderIndex;
for (unsigned int i = 0; i < lengthOfLeaders; i++)
{
float currentLength = (
(particles[idx].position.x - leaders[i].position.x) * (particles[idx].position.x - leaders[i].position.x)
) + (
(particles[idx].position.y - leaders[i].position.y) * (particles[idx].position.y - leaders[i].position.y)
);
if (currentLength < bestLength)
{
bestLength = currentLength;
bestLeaderIndex = i;
}
}
Particle bestLeader = leaders[bestLeaderIndex];
float differenceX = bestLeader.position.x - particles[idx].position.x;
float differenceY = bestLeader.position.y - particles[idx].position.y;
float newLength = sqrtf(differenceX * differenceX + differenceY * differenceY);
// If the newLenght is equal to zero, then the particle is at the same position as leader
// TODO: HERE I NEED SORT NOT EXISTING PARTICLE TO THE END
if (newLength <= sizeOfLeader / 2) { particles[idx].exists = false; idx += skip; continue; }
// Current height at the position
const uchar4 texelOfHeight = tex2D<uchar4>(heightMapTexture, particles[idx].position.x, particles[idx].position.y);
// Normalize vector
differenceX /= newLength;
differenceY /= newLength;
int nextPositionOnMapX = round(particles[idx].position.x + differenceX);
int nextPositionOnMapY = round(particles[idx].position.y + differenceY);
// Height of the next position
const uchar4 texelOfNextPosition = tex2D<uchar4>(heightMapTexture, nextPositionOnMapX, nextPositionOnMapY);
float differenceHeight = texelOfHeight.x - texelOfNextPosition.x;
float speed = sqrtf(speedFactor + 2 * fabsf(differenceHeight));
// Multiply by speed
differenceX *= speed;
differenceY *= speed;
particles[idx].position.x += differenceX;
particles[idx].position.y += differenceY;
idx += skip;
}
}
One possible solution what am I thinking about is do own kernel function which will only shifting particles. Something like this
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
//TODO: Shifting...
}
Upvotes: 1
Views: 230
Reputation: 50453
Sorting on GPUs is rather inefficient, so it is better to select the values to keep and perform a partition based on them. To do that easily, you can use CUB which is quite efficient (as it often implement best state-of-the-art algorithm or close to). You can use DevicePartition or two DeviceSelect (the former will likely be faster, except if you do not want to keep dead particles at all). You could also use block primitives if you want to perform some advanced tweaks/optimizations.
If you still want to do this yourself for some reason (eg. reducing the number of dependencies in your project), then you can use atomic adds on relatively new devices since they are very-well optimized by the hardware. On old device you could use scans to do that but it is a but harder to implement. The thing is atomics do not scale particularly when there is a lot of SM, so you need to perform some advanced blocking strategy. Here is an untested naive implementation to understand the idea:
// PS: what is the difference between sizeOfParticle and lengthOfParticles?
// pos must be initialized to 0 and contains the number of living particles (pivot) once the kernel finished its execution.
__global__ void shiftParticles(const Particle* particles, const unsigned int lengthOfParticles, const unsigned int sizeOfParticle, Particle* outParticles, int* pos) {
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int skip = gridDim.x * blockDim.x;
const bool exists = particles[idx].exists;
const int localPos = atomicAdd(pos, exists); // Here is the important point
const Particle current = particles[idx];
// outParticles is a needed temporary array or output one
// as the operation cannot be efficiently performed in parallel.
// It should likely be allocated and provided in argument to the kernel
if(exists)
{
// Move the current particle to the beginning
outParticles[localPos] = current;
}
else
{
// Move the current particle to the end
outParticles[lengthOfParticles-1-idx+localPos] = current;
}
}
Note that the ordering is not preserved due to the atomic operations. If you need to keep the order of the particles, then it gets significantly more complicated, especially on GPUs, since it would make the algorithm more sequential. A naive solution could be to use a stable sort in that case. Another solution is to use a global scan followed by an indirection to store the values (so with two pass). Implementing an efficient scan is a bit complex/tedious. Hopefully, CUB can help a lot in this case with its DeviceScan primitive.
Finally note that using array of structures is not efficient, especially on hardware using SIMD instructions like GPUs. The implementation should be significantly faster with structures of arrays (due to cache lines, coalescence, contiguity of access pattern, etc.).
Upvotes: 2