user3687626
user3687626

Reputation: 11

gpgpu: how to estimate speed gains based on gpu and cpu specifications

I am a complete beginner to gpgpu and opencl. I am unable to answer the following two questions about GPGPU in general,

a) Suppose I have a piece of code suitable to be run on a gpu (executes the exact same set of instructions on multiple data). Assume I already have my data on the gpu. Is there any way to look at the specifications of the cpu and gpu, and estimate the potential speed gains? For example, how can I estimate the speed gains (excluding time taken to transfer data to the gpu) if I ran the piece of code (running exact same set of instructions on multiple data) on AMDs R9 295X2 gpu (http://www.amd.com/en-us/products/graphics/desktop/r9/2...) instead of intel i7-4770K processor (http://ark.intel.com/products/75123)

b) Is there any way to estimate the amount of time it would take to transfer data to the gpu?

Thank you!


Thank you for the responses! Given the large number of factors influencing speed gains, trying and testing is certainly a good idea. However, I do have a question on the GFLOPS approach mentioned some responses; GFLOPS metric was what I was looking at before posting the question.

I would think that GFLOPS would be a good way to estimate potential performance gains for SIMD type operations, given that it takes into account difference in clock speeds, cores, and floating point operations per cycle. However, when I crunch numbers using GFLOPS specifications something does not seem correct.

The Good:

GFLOPS based estimate seems to match the observed speed gains for the toy kernel below. The kernel for input integer "n" computes the sum (1+2+3+...+n) in a brute force way. I feel, the kernel below for large integers has a lot of computation operations. I ran the kernel for all ints from 1000 to 60000 on gpu and cpu (sequentially on cpu, without threading), and measured the timings.

__kernel void calculate(__global int* input,__global int* output){

size_t id=get_global_id(0);
int inp_num=input[id];
int si;
int sum;
sum=0;
for(int i=0;i<=inp_num;++i)
    sum+=i;

output[id]=sum; 

}

GPU on my laptop: NVS 5400M (www.nvidia.com/object/nvs_techspecs.html) GFLOPS, single precision: 253.44 (en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units)

CPU on my Laptop: intel i7-3720QM, 2.6 GHz GFLOPS (assuming single precision): 83.2 (download.intel.com/support/processors/corei7/sb/core_i7-3700_m.pdf). Intel document does not specify if it is single or double

CPU Time: 3.295 sec

GPU Time: 0.184 sec

Speed gains per core: 3.295/0.184 ~18

Theoretical Estimate of Speed gains with Using all 4 cores: 18/4 ~ 4.5

Speed Gains based on FLOPS: (GPU FLOPS)/(CPU FLOPS) = (253.44/83.2) = 3.0

For the above example GLOPS based estimate seems to be consistent with those obtained from experimentation, if the intel documentation indeed specifies FLOPS for single and not double precision. I did try to search for more links for flops specification for the intel processor on my laptop. The observed speed gain also seems good, given that I have a modest GPU

The Problem:

The FLOPS based approach seems to give a much lower than expected speed gains, after factoring gpu price, when comparing AMDs R9 295X2 gpu (www.amd.com/en-us/products/graphics/desktop/r9/295x2#) with intels i7-4770K (ark.intel.com/products/75123):

AMDs FLOPS, single precision: 11.5 TFLOPS (from above mentioned link)

Intels FLOPS, single precision: (num. of cores) x (FLOPS per cycle per core) x (clock speed) = (4) x (32 (peak) (www.pcmag.com/article2/0,2817,2419798,00.asp)) x (3.5) = 448 GFLOPS

Speed Gains Based on FLOPS = (11.5 TFLOPS)/(448) ~ 26

AMD GPUs price: $1500

Intel CPUs price: $300

For every AMD R9 295X2 gpu, I can buy 5 intel i7-4770K cpus, which reduces the effective speed gains to (26/5) ~ 5. However, this estimate is not at all consistent with the 100-200x, increase in speed one would expect. The low estimate in speed gains by the GFLOPS approach makes my think that something is incorrect with my analysis, but I am not sure what?

Upvotes: 1

Views: 2234

Answers (5)

Krishnaraj
Krishnaraj

Reputation: 421

loads are usually classified into 2 categories

  1. bandwidth bound - more time is spent on fetches from global-memory. Even increasing cpu clock freq doesn't help. problems like sorting. bandwidth capacity is measured using GBPS
  2. compute bound - directly proportional to cpu horse-power. problems like matrix multiplication. compute capacity is measured using GFLOPS

there is a tool clpeak which tries to programmatically measure these

its very important to classify your problem to measure its performance & choose the right device(knowing their limits)

say if you compare intel-HD-4000 & i7-3630(both on same chip) in https://github.com/krrishnarraj/clpeak/tree/master/results/Intel%28R%29_OpenCL

  1. i7 is comparatively better at bandwidth(plus no transfer overheads)
  2. in terms of compute, gpu is 4-5 times faster than i7

Upvotes: 0

huseyin tugrul buyukisik
huseyin tugrul buyukisik

Reputation: 11920

If code is easy(is what lightweight- cores of gpu need) and is not memory dependent then you can approximate to :

 Sample kernel:
 Read two 32-bit floats from memory and 
 do calcs on them for 20-30 times at least. 
 Then write to memory once.

 New: GPU
 Old: CPU

 Gain ratio = ((New/Old) - 1 ) *100  (%)

 New= 5000 cores * 2 ALU-FPU per core * 1.0 GHz frequency = 10000 gflops

 Old = 10 cores * 8 ALU-FPU per core * 4.0GHz frequency = 320 gflops

 ((New/Old) - 1 ) *100 ===> 3000% speed gain.

 This  is when code uses registers and local memory mostly. Rarely hitting global mem.

If code is hard( heavy branching + fake recursivity + non-uniformity ) only 3-5 times speed gain. it can be equal or less than CPU performance for linear code ofcourse.

When code is memory dependant, it will be 1TB/s(GPU) divided by 40GB/s(CPU).

If each iteration needs to upload data to gpu, there will be pci-e bandwidth bottlenect too.

Upvotes: 0

Roman Arzumanyan
Roman Arzumanyan

Reputation: 1814

  1. Potential speed gain highly depends on algorithm implementation. It's difficult to forecast performance level unless you're developing come very simple applications (like simplest image filter). In some cases, estimations can be done, using memory system performance as basis, as many algorithms are bandwidth-bound.
  2. You can calculate transmission time by dividing data amount on GPU memory bandwidth for Device-internal operations. Look at hardware characteristics to get it, or calculate if you know memory frequency & bus width. For Host-Device operations, PCI-E bus speed is the limit usually.

Upvotes: 0

DarkZeros
DarkZeros

Reputation: 8410

A) Normally this question is never answered. Since we are not speaking at 1.05x speed gains. When the problem is suitable, the problem is BIG enough to hide any overheads (100k WI), and the data is already in the GPU, then we are speaking of speeds of 100-300x. Normally nobody cares if it is 250x or 251x.

The estimation is difficult to make, since the platforms are completely different. Not only on clock speeds, but memory latency and caches, as well as bus speeds and processing elements.

I cannot give you a clear answer on this, other than try it and measure.


B) The time to copy the memory is completely dependent on the GPU-CPU bus speed (PCI bus). And that is the HW limit, in practice you will always have less speed than that on copying. Generally you can apply the rule of three to solve the time needed, but there is always a small driver overhead that depends on the platform and device. So, copying 100 bytes is usually very slow, but copying some MB is as fast as the bus speed.

The memory copying speed is usually not a design constrain when creating a GPGPU app. Since it can be hided in many ways (pinned memory, etc..), that nodoby will notice any speed decrease due to memory operations.


You should not make any decisions on whether the problem is suitable or not for GPU, just by looking at the time lost at memory copy. Better measures are, if the problem is suitable, and if you have enough data to make the GPU busy (otherwise it is faster to do it in CPU directly).

Upvotes: 0

csnate
csnate

Reputation: 1641

You need to examine the kernel(s). I myself am learning CUDA, so I couldn't tell you exactly what you'd do with OpenCL.

But I would figure out roughly how many floating point operations one single instance of the kernel will perform. Then find the number of floating point operations per second each device can handle.

number of kernels to be launched * (n floating-point operations of kernel / throughput of device (FLOPS)) = time to execute

The number of kernels launched will depend on your data.

Upvotes: 0

Related Questions