Gregor Isack
Gregor Isack

Reputation: 1131

Why does CPU run faster than GPU in this code?

I'm trying to speed up my computing by using gpuArray. However, that's not the case for my code below.

for i=1:10
    calltest;
end

function [t1,t2]=calltest
N=10;
tic
u=gpuArray(rand(1,N).^(1./[N:-1:1]));
t1=toc
tic
u2=rand(1,N).^(1./[N:-1:1]);
t2=toc
end

where I get

t1 =

   4.8445e-05


t2 =

   1.4369e-05

I have an Nvidia GTX850M graphic card. Am I using gpuArray incorrectly? This code is wrapped inside a function, and the function is called by a loop thousands of times.

Upvotes: 1

Views: 567

Answers (1)

user3666197
user3666197

Reputation: 1

Why?
Because there is both
a) a small problem-scale
&
b) not a "mathematically-dense" GPU-kernel

The method of comparison is blurring the root-cause of the problem

Step 0: separate data-set ( a vector ) creation from section-under-test:

N = 10;
R = rand( 1, N );
tic; < a-gpu-based-computing-section>; GPU_t = toc
tic; c = R.^( 1. / [N:-1:1] );         CPU_t = toc

Step 1: test the scaling:

trying just 10 elements, will not make the observation clear, as an overhead-naive formulation of Amdahl Law does not explicitly emphasise the added time, spent on CPU-based GPU-kernel assembly & transport + ( CPU-to-GPU + GPU-to-CPU ) data-handling phases. These add-on phases may get negligibly small, if compared to
a) an indeed large-scale vector / matrix GPU-kernel processing, which N ~10 obviously is not
or
b) an indeed "mathematically-dense" GPU-kernel processing, which R.^() obviously is not
so,
do not blame the GPU-computing for having acquired a must-do part ( the overheads ) as it cannot get working without this prior add-ons in time ( and CPU may, during the same amount of time, produce the final result - Q.E.D. )

Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:

N = 10;                                     %% 100, 1000, 10000, 100000, ..
tic; CPU_hosted     = rand( N, 'single' );  %% 'double'
CPU_gen_RAND        = toc

tic; GPU_hosted_IN1 = gpuArray( CPU_hosted );
GPU_xfer_h2d        = toc

tic; GPU_hosted_IN2 = rand( N, 'gpuArray' );
GPU_gen__h2d        = toc

tic; <kernel-generation-with-might-be-lazy-eval-deferred-xfer-setup>;
GPU_kernel_AssyExec = toc

tic; CPU_hosted_RES = gather( GPU_hosted_RES );
GPU_xfer_d2h        = toc

Upvotes: 2

Related Questions