Reputation: 33
I want to evaluate my GPU's memory bandwidth. I wrote a snippet to copy an int8_t
array to another one and measure the time. I set a block of 64 threads and a grid of (<size of my array>/<block size>/<data size a thread needs to copy>)
blocks. I have used cudaEvent
and chrono::high_resolution
to measure the time. The results shows that it can achieve only tens of GB/s that is far from the Nvidia given bandwidth 4000 GB/s. So, the problem is why does my plan not work and how can I solve it?
My hardware is Nvidia H20 and Intel Xeon chip. The OS is CentOS.
My kernel function is as follows:
__global__ void copyGpuMem(int8_t *d_B,int8_t *d_A,size_t oneThreadDataSize){
int tid=threadIdx.x;
int bid=blockIdx.x;
size_t off=(bid*blockDim.x+tid)*oneThreadDataSize;
for(int i=0;i<oneThreadDataSize;i++){
d_B[off+i]=d_A[off+i];
}
}
My launching kernel code is as follows:
int8_t *d_A;
int8_t *d_B;
size_t blockSize=64;
const size_t oneThreadDataSize=1e3;
const size_t oneBlockDataSize=blockSize*oneThreadDataSize;
size_t n=(1e9+oneBlockDataSize-1)/oneBlockDataSize*oneBlockDataSize;
int loop=100;
size_t gridSize=n/oneBlockDataSize;
CHECK_CUDA(cudaMalloc(d_A,n*sizeof(int8_t)));
CHECK_CUDA(cudaMalloc(d_B,n*sizeof(int8_t)));
auto start=std::chrono::high_resolution_clock::now();
for(int i=0;i<loop;i++){
copyGpuMem<<<int(gridSize),int(blockSize)>>>(d_B,d_A,oneThreadDataSize);
}
auto end=std::chrono::high_resolution_clock::now();
double bw=(2*n*loop)/(start-end)/1e9 //unit GB/s
My result bandwidth is around 200 GB/s but the given bandwidth is 4000 GB/s. I have no idea with it.
I have tried to lower down the data size that each thread needs to deal with. It improves the result from tens of GB/s to at most 200 GB/s. Yeah, it's the truth. When I set the oneThreadDataSize=1e6
, the result is about 10-20 GB/s. But when I continue lowering down the size, it doesn't work well.
Upvotes: 0
Views: 336
Reputation: 13438
From the comments:
Your memory access pattern is not properly coalesced. Please review the relevant section in the Best Practices Guide. Also, for best per-thread performance, use int4
or other large, aligned data types. One int
at a time should be the minimum.
Follow-up questions:
I followed @Homer512 's advise and tried to use uint4 instead of int8_t to measure the bandwidth. I got the results nearly 3200 GB/s that is closer to the official specification. Is it because gpu transfers 16 bytes together to the core once?
No, memory transfers on current CUDA hardware happen in 32 byte transactions. However, with 1 byte uncoalesced access you made the system do a 32 byte transfer for 1 byte used payload. Now 16 out of 32 byte are used. It might also have a positive effect on cache hit rate since the second half of that transaction is still pulled into L2 cache and might still be around when the next access occurs.
And I have another question: whether the L1/L2 affect my measurement?
Yes, it will help but the size of your data set exceeds the cache size. This will limit its effectiveness. Just use the visual profiler Nsight Compute. It will tell you bandwidth and cache hit rates.
How can I code to achieve a coalesced memory access.
The best practices guide explains it in detail but it is rather simple: One warp (group of 32 threads) acts as one. Memory access happens in 32 byte transactions. The largest types that can be accessed in a single memory operation are int4
and float4
with 16 byte. So you either need groups of two threads accessing two neighbouring int4
or 32 threads accessing a single byte, or anything in between.
Usually you just have all 32 threads access 32 elements in a row, no matter what type they are. Memory alignment also plays a role but if you process whole arrays at once, that is usually a given.
Here is the basic pattern applied to your code:
__global__ void copyGpuMem(int8_t *d_B,int8_t *d_A,size_t oneThreadDataSize){
int tid=threadIdx.x;
int bid=blockIdx.x;
size_t off=bid*blockDim.x+tid;
size_t stride = blockDim.x * gridDim.x;
for(size_t i=0;i<oneThreadDataSize;i++){
d_B[off+i*stride]=d_A[off+i*stride];
}
}
Notice how in each loop iteration, neighbouring threads access neighbouring elements, offset by a thread's tid
. Of course now it is a bit awkward because that oneThreadDataSize
is really inconvenient. Normally you write these types of loops like this:
__global__ void copyGpuMem(int8_t *d_B, int8_t *d_A, size_t fullArraySize){
unsigned tid=threadIdx.x;
unsigned bid=blockIdx.x;
size_t off=bid*blockDim.x+tid;
size_t stride = blockDim.x * gridDim.x;
for(size_t i=off; i<fullArraySize; i+=stride){
d_B[i]=d_A[i];
}
}
Notice the change in meaning of the last function parameter. This pattern is known as a grid-strided loop.
One byte at a time is still very low throughput for a single thread. You can combine it with larger vector types. A more generic memcpy
-style kernel may look like this:
#include <stdint.h>
__global__ void copyGpuMem(
int8_t *d_B, const int8_t *d_A, size_t fullArraySize) {
size_t off = blockIdx.x*blockDim.x+threadIdx.x;
size_t stride = blockDim.x * gridDim.x;
uintptr_t bAddress = reinterpret_cast<uintptr_t>(d_B);
uintptr_t aAddress = reinterpret_cast<uintptr_t>(d_A);
if(! (bAddress % alignof(int4) || aAddress % alignof(int4))) {
/*
* Likely case: Addresses are 16 byte aligned for int4 access
*/
const size_t bytesAtOnce = sizeof(int4) / sizeof(int8_t);
size_t int4ArraySize = fullArraySize / bytesAtOnce;
int4* d_B_i4 = reinterpret_cast<int4*>(d_B);
const int4* d_A_i4 = reinterpret_cast<const int4*>(d_A);
/* Copy 16 byte at once */
for(size_t i=off; i<int4ArraySize; i+=stride){
d_B_i4[i] = d_A_i4[i];
}
/*
* If the full size is not divisible by 16, there is a tail.
* We adjust the offset so that the following loop can deal with it
*/
off += int4ArraySize * bytesAtOnce;
}
/*
* Either deals with unaligned arrays or with the last few entries
* in an aligned array case
*/
for(size_t i=off; i<fullArraySize; i+=stride){
d_B[i] = d_A[i];
}
}
All that stuff with reinterpret_cast
and alignment checks is only necessary because you use arrays of bytes. If it was an array of int
or float
, the performance gain from going to vector types is probably not worth complicating the code for, especially in a real kernel that does a bit more work than just copying.
Upvotes: 3