Reputation: 5064
Suppose I have huge matrices (to be processed somehow, e.g. multiplied) that exceed the available amount of the device memory. Is there a standard way to handle such an issue? Maybe, usage of zero-copy memory implicitly gives a way to a copy chunk-by-chunk when needed?
Or I have to handle this explicitly by loading the data in pieces?
Upvotes: 0
Views: 549
Reputation: 2856
CUDA provides a mechanism of CUDA Streams
. Until one stream of data comes the previous stream is processed in the time.
This mechanism is used for processing matrices which does not fit in memory of GPUs.
Data is divided in number of chunks and these chunks are copied via streams. Actually if you copy chunk-by-chunk and process one chunk at a time then it is again single threaded program. Instead you should copy Stream array of size 16 or 32 at least.
This is overriding copying of data using asynchronous memcpy function.
Upvotes: 3