Reputation: 132340
the cudaHostAlloc()
API call has, among others, the flags:
- cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
- cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
I could quite understand when exactly I would prefer the "write-combined" option. I mean, it didn't say the transfer may be faster just in one direction, so why do they only recommend it for one direction? Also, which kind of systems benefit from this "write-combining"?
I read this white paper, Section 4.7, and still could not get it. Ok, so reading by the CPU is inefficient; but what if other benefits offset this inefficiency? Or - if they cannot, why can't they?
An elucidation would be appreciated.
Upvotes: 4
Views: 1612
Reputation: 26225
Write-combined memory allows the CPU to combine multiple narrow writes into fewer wider writes, thus increasing the efficiency of memory writes. If memory serves, WC memory was first introduced with the Intel PentiumPro around 1995 to speed up CPU writes into the frame buffer of video cards. I am not up to speed on which modern system platforms use or support this.
The efficiency of reads performed by the CPU is going to be the same for both cudaHostAllocMapped
and cudaHostAllocWriteCombined
. But because the latter allows more efficient writes by the CPU, it is recommended for "buffers that will be written by the CPU and read by the device", as stated by quoted documentation.
Upvotes: 4