How to shuffle data efficiently within device memory?

Question

Question

When moving many random (non-coalesced) values within a device global memory, which is the most efficient way to do it?

Note: Many values as in > 500.

Context

I've been working in a Genetic Algorithm implementation for GPU's for some time now, and I've been trying struggling between the flexibility of my framework, and micro optimizing for the GPU architecture. The GA data resides in the GPU at all times. Only best generational solutions are copied to the host memory.

Detailed scenario

I'm optimizing the migration function. Here basically little data is shuffled within the device Global Memory. But I've got my data order in such way it's coalesced for the GA operators kernel threads' memory access scheme, which makes shuffling a pair of "genomes", a matter of striding for single FLOAT values, and swapping them with another genome in the same striding fashion.

Known solutions

The problem isn't memory bandwidth, but a matter of call latency and thread blocks stalling the process.

I wrote a couple of device kernels, which function is merely to move values among the addresses. This would launch a kernel (with VERY low occupancy, divergent code and random memory access... therefor the little code it runs, would be serialized), but would do the work with only two kernel calls to the device.
- 1st Kernel Copies values to buffer array.
- 2nd Kernel Swaps values.
I'm aware I could use cudaMemcpy for every value, but that would require many calls to cudaMemCpy, which I think to be synchronous calls.

Simplified code example:

int needed_genome_idx = 0; // Some random index.
for(int nth_gene = 0; nth_gene < num_genes; ++nthgene)
{
  cudaMemcpy(genomes_buffer + nth_gene,
             src + needed_genome_idx + nth_gene * stride_size, // stride_size being a big number, usually equal to the size of the GA population.
             sizeof(float),
             cudaMemCpyDeviceToDevice);
}

Is this a viable solution? Would using cudaMemCpyAsync help performance?

Is there a better way, or at least more elegant way, to do such memory operations?

Kill Console · Accepted Answer

You can try to write a kernel to complete the shuffle, maybe more efficient than call cudaMemcpy so many times.

How to shuffle data efficiently within device memory?

Question

Context

Detailed scenario

Known solutions

Answers (1)

Related Questions