Reputation: 23
Is there an optimal data structure to transfer data using cudamemcopy(... , devicetohost)? I've found that arrays work a lot faster than structs. Is there a reason for this and is there a more optimal method?
edit -
It seems that my timing wasn't being recorded correctly. The amount of time for the structs and arrays should be about equal. I will try using the cuda events api to record the time.
Upvotes: 0
Views: 143
Reputation: 8038
Personally, I am skeptical that the performance difference is due to the copy.
Perhaps your data structure is being aligned in a way that there are empty gaps.
A second cause could be due to memory page alignment handling. When you get memory using malloc it can be fragmented similar to the layout of the Windows filesystems. The level of fragmentation can very, but it is not unreasonable to say that if you do a single call to malloc you get memory that is continuously aligned, while if you do many calls, you can get memory with gaps.
CUDA's memory copy has to deal with this additional overhead by checking the pages one by one and manually moving them to the GPU.
The real solution to your problem will be using cudaMallocHost
to allocate memory that the CPU doesn't have to worry about. Try doing this and see if it fixes your problem.
Upvotes: 0
Reputation: 21818
Structure-of-Arrays are usually better than Arrays-of-Structs when loading data from/to global memory into shared/registers when in the kernel. However, I don't think there is any performance difference between SoA and AoS when copying the data from/to host to/from device (in one big memcopy transaction). After all, the amount of data is the same.
The only exception is if some extra padding bytes are added at the end of the struct to achieve certain memory alignment of the elements of the AoS.
I think there might be some other reason why you are experiencing performance differences.
Upvotes: 1