Dredok
Dredok

Reputation: 805

Different views of memory bandwidth between visual profiler and nsight analysis

I'm using Cuda 5.5 under windows, with VS2010, nsight 3.1 and bundled visual profiler.

I have a toy kernel which only do stores and I see different data from nsight and visual profiler. Which should I trust? and why do I get different views?

Nsight says 4.21MB stores and visual profiler says 71402 transactions which represents 8.9MB (assuming all of them are 128B). Consequently, Nsight says BW is 277GB/s and visual profiler 126.69GB/s

I see Nsight data more close to reality, since my dataset is 1024x1024.

EDIT

I have deleted a lot of bad assumptions from my original question. I was thinking somewhat in CPUs and caches coherence.

Access pattern: each thread performs 4 stores of 1 byte consecutive like this (dst is char*):

for (int i = 0; i < 4; i++) {
   dst[offset+i] = 0;
}

Visual profiler

Nsight

Upvotes: 0

Views: 509

Answers (1)

thalie
thalie

Reputation: 11

There is a difference between Device memory and Global memory. In the programming guide, it says that device memory includes "global, local, shared, constant, or texture memory" (see 5.3.2).

In your first picture, Global loads and stores should be in the first table named L1/Shared Memory (which is not visible in your capture).

Upvotes: 1

Related Questions