Reputation: 2731
Does cudaMalloc allocate contiguous chunks of memory (i.e., physical bytes next to each other)?
I have a piece of CUDA code that simply copies 128 bytes from global device memory to shared memory, using 32 threads. I am trying to find a way to guarantee that this transfer can be completed in one memory transaction of 128 byes. If cudaMalloc allocates contiguous memory blocks, then it can be easily done.
Following is the code:
#include <iostream>
using namespace std;
#define SIZE 32 //SIZE of the array to store in shared memory
#define NUMTHREADS 32
__global__ void copy(uint* memPointer){
extern __shared__ uint bits[];
int tid = threadIdx.x;
bits[tid] = memPointer[tid];
}
int main(){
uint inputData[SIZE];
uint* storedData;
for(int i=0;i<SIZE;i++){
inputData[i] = i;
}
cudaError_t e1=cudaMalloc((void**) &storedData, sizeof(uint)*SIZE);
if(e1 == cudaSuccess){
cudaError_t e3= cudaMemcpy(storedData, inputData, sizeof(uint)*SIZE, cudaMemcpyHostToDevice);
if(e3==cudaSuccess){
copy<<<1,NUMTHREADS, SIZE*4>>>(storedData);
cudaError_t e6 = cudaFree(storedData);
if(e6==cudaSuccess){
}
else{
cout << "Error freeing memory storedData" << e6 << endl;
}
}
else{
cout << "Failed to copy" << " " << e3 << endl;
}
}
else{
cout << "Failed to allocate memory" << " " << e1 << endl;
}
return 0;
}
Upvotes: 0
Views: 3089
Reputation: 183
Yes, cudaMalloc allocates contiguous chunks of memory. The "Matrix Transpose" example in the SDK (http://developer.nvidia.com/cuda-cc-sdk-code-samples) has a kernel called "copySharedMem" that does almost exactly what you're describing.
Upvotes: 2