Bezewy
Bezewy

Reputation: 342

Increase utilization of GPU when using Mathematica CUDADot?

I've recently started using Mathematica's CUDALink with a GT430 and am using CUDADot to multiply a 150000x1038 matrix (encs) by a 1038x1 matrix (probe). Both encs and probe are registered with the memory manager:

mmEncs = CUDAMemoryLoad[encs];
mmProbe = CUDAMemoryLoad[probe];

I figured that a dot product of these would max out the GT430, so I tested with the following:

For[i = 0, i < 10, i++,
 CUDADot[mmEncs, mmProbe];
]

While it runs, I use MSI's "Afterburner" utility to monitor GPU usage. The following screenshot shows the result:

enter image description here

There's a distinct peak for each CUDADot operation and, overall, I'd say this picture indicates that I'm utilizing less than 1/4 of GPU capacity. Two questions:

Q1: Why do peaks max out at 50%? Seems low.

Q2: Why are there are such significant periods of inactivity between peaks?

Thanks in advance for any hints! I have no clue w.r.t. Q1 but maybe Q2 is because of unintended memory transfers between host and device?

Additional info since original posting: CUDAInformation[] reports "Core Count -> 64" but NVIDIA Control Panel reports "CUDA Cores: 96". Is there any chance that CUDALink will under-utilize the GT430 if it's operating on the false assumption that it has 64 cores?

Upvotes: 3

Views: 595

Answers (1)

talonmies
talonmies

Reputation: 72349

I am going to preface this answer by noting that I have no idea what "MSI Afterburner" is really measuring, or at what frequency it is sampling that quantity which it measures, and I don't believe you do either. That means we don't know what either the units of x or y axis in your screenshot are. This makes any quantification of performance pretty much impossible.

1.Why do peaks max out at 50%? Seems low.

I don't believe you can say it "seems low" if you don't know what it is really measuring. If, for example, it measures instruction throughput, it could be that the Mathematica dot kernel is memory bandwidth limited on your device. That means the throughput bottleneck of the code would be memory bandwidth, rather than SM instruction throughput. If you were to plot memory throughput, you would see 100%. I would expect a gemv operation to be memory bandwidth bound, so this result is probably not too surprising.

2.Why are there are such significant periods of inactivity between peaks?

The CUDA API has device and host side latency. On a WDDM platform (so Windows Vist, 7, 8, and whatever server versions are derived from them), this host side latency is rather high and the CUDA driver does batching of operations to help amortise that latency. This batching can lead to "gaps" or "pauses" in GPU operations. I think that is what you are seeing here. NVIDIA have a dedicated computation driver (TCC) for Telsa cards on the Windows platform to overcome these limitations.

A much better way to evaluate the performance of this operation would be to time the loop yourself, compute an average time per call, calculate the operation count (a dot product has a known lower bound you can work out from the dimensions of the matrix and vector), and compute a FLOP/s value. You can compare that to the specifications of your GPU to see how well or badly it is performing.

Upvotes: 1

Related Questions