Why does CPU run faster than GPU in this code?

Question

I'm trying to speed up my computing by using gpuArray. However, that's not the case for my code below.

for i=1:10
    calltest;
end

function [t1,t2]=calltest
N=10;
tic
u=gpuArray(rand(1,N).^(1./[N:-1:1]));
t1=toc
tic
u2=rand(1,N).^(1./[N:-1:1]);
t2=toc
end

where I get

t1 =

   4.8445e-05


t2 =

   1.4369e-05

I have an Nvidia GTX850M graphic card. Am I using gpuArray incorrectly? This code is wrapped inside a function, and the function is called by a loop thousands of times.

user3666197 · Accepted Answer

Why?
Because there is both
a) a small problem-scale
&
b) not a "mathematically-dense" GPU-kernel

The method of comparison is blurring the root-cause of the problem

Step 0: separate data-set ( a vector ) creation from section-under-test:

N = 10;
R = rand( 1, N );
tic; < a-gpu-based-computing-section>; GPU_t = toc
tic; c = R.^( 1. / [N:-1:1] );         CPU_t = toc

Step 1: test the scaling:

trying just 10 elements, will not make the observation clear, as an overhead-naive formulation of Amdahl Law does not explicitly emphasise the added time, spent on CPU-based GPU-kernel assembly & transport + ( CPU-to-GPU + GPU-to-CPU ) data-handling phases. These add-on phases may get negligibly small, if compared to
a) an indeed large-scale vector / matrix GPU-kernel processing, which N ~10 obviously is not
or
b) an indeed "mathematically-dense" GPU-kernel processing, which R.^() obviously is not
so,
do not blame the GPU-computing for having acquired a must-do part ( the overheads ) as it cannot get working without this prior add-ons in time ( and CPU may, during the same amount of time, produce the final result - Q.E.D. )

Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:

N = 10;                                     %% 100, 1000, 10000, 100000, ..
tic; CPU_hosted     = rand( N, 'single' );  %% 'double'
CPU_gen_RAND        = toc

tic; GPU_hosted_IN1 = gpuArray( CPU_hosted );
GPU_xfer_h2d        = toc

tic; GPU_hosted_IN2 = rand( N, 'gpuArray' );
GPU_gen__h2d        = toc

tic; ;
GPU_kernel_AssyExec = toc

tic; CPU_hosted_RES = gather( GPU_hosted_RES );
GPU_xfer_d2h        = toc

Why does CPU run faster than GPU in this code?

Answers (1)

Why?
Because there is both
a) a small problem-scale
&
b) not a "mathematically-dense" GPU-kernel

Step 0: separate data-set ( a vector ) creation from section-under-test:

Step 1: test the scaling:

Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:

Related Questions

Why does CPU run faster than GPU in this code?

Answers (1)

Why?Because there is botha) a small problem-scale &b) not a "mathematically-dense" GPU-kernel

Step 0: separate data-set ( a vector ) creation from section-under-test:

Step 1: test the scaling:

Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:

Related Questions

Why?
Because there is both
a) a small problem-scale
&
b) not a "mathematically-dense" GPU-kernel