MaJoR
MaJoR

Reputation: 1044

Mismatch in results generated by CPU and GPU

I was setting up to use Numba along with my AMD GPU. I started out with the most basic example available on their website, to calculate the value of Pi using the Monte-Carlo simulation.

I made some changes to the code so that it can run on GPU first and then on the CPU. By doing this, I just wanted to compare the time taken to execute the code and verify the results. Below is the code:

from numba import jit
import random
from timeit import default_timer as timer

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

def monte_carlo_pi_cpu(nsamples):
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

num = int(input())

start = timer()
random.seed(0)
print(monte_carlo_pi(num))
print("with gpu", timer()-start)

start = timer()
random.seed(0)
print(monte_carlo_pi_cpu(num))
print("without gpu", timer()-start)

I was expecting the GPU to perform better, and so it did. But however, some results for the CPU and the CPU were not matching.

1000000 # input parameter
3.140836 # gpu_out
with gpu 0.2317520289998356
3.14244 # cpu_out
without gpu 0.39849199899981613

I am aware that Python does not fare the long floating-point operations that well, but these are only 6 decimal places, and I was not expecting such a large discrepancy. Can anyone explain as to why this difference comes up?

Upvotes: 2

Views: 785

Answers (2)

user3666197
user3666197

Reputation: 1

Q : Can anyone explain as to why this difference comes up?

The availability and almost pedantic care of systematic use of re-setting the same state via the PRNG-of-choice
.seed( aRepeatableExperimentSeedNUMBER )-method is the root-cause of all these surprises.

Proper seeding works if and only if the same PRNG-algorithm is used - being principally different in random-module's .random()-method than the one in numpy.random-module's .random().

Another sort of observed artifact ( different values of the dart-throwing pi-guesstimates ) is related to a rather tiny scale ( yes, 1E6-points is a tiny amount, compared to the initial axiom of the art of statistics - which is "using infinitely and only infinitely sized populations" ), where different order of using thenumbers that have been ( thanks to a pedantic and systematic re-seed(0)-ing the PRNG-FSA ) reproducibly generated into the always the same sequence of values, produces different results ( see difference of values in yesterday's experiments ). These artifacts, however, play less and less important role as the size grows ( as was shown at the very bottom, reproducible experiment ):

# 1E+6: 3.138196           # block-wise generation in np.where().sum()
#       3.140936           #  pair-wise generation in monte_carlo_pi2()
# 1E+7: 3.142726           # block-wise generation in np.where().sum()
#       3.142358           #  pair-wise generation in monte_carlo_pi2()
# 3E+7: 3.1421996          # block-wise generation in np.where().sum()
#       3.1416629333333335 #  pair-wise generation in monte_carlo_pi2()
# 1E+8: 3.14178916         # block-wise generation in np.where().sum()
#       3.14167324         #  pair-wise generation in monte_carlo_pi2()
# 1E+9: -.                 # block-wise generation in np.where().sum() -x-RAM-SWAP-
#       3.141618484        #  pair-wise generation in monte_carlo_pi2()
# 1E10  -.                 # block-wise generation in np.where().sum() -x-RAM-SWAP-
#       3.1415940572       #  pair-wise generation in monte_carlo_pi2()
# 1E11  -.                 # block-wise generation in np.where().sum() -x-RAM-SWAP-
#       3.14159550084      #  pair-wise generation in monte_carlo_pi2()

Next, let me show another aspect:

What are the actual costs of doing so and where do they come from ?!?

A plain pure-numpy code was to compute this in on localhost in about 108 [ms]

>>> from zmq import Stopwatch; clk = Stopwatch()            # [us]-clock resolution
>>> np.random.seed(0); clk.start();x = np.random.random( 1000000 ); y = np.random.random( 1000000 ); _ = ( np.where( x**2 + y**2 < 1.0, 1, 0 ).sum() * 4.0 / 1000000 );clk.stop()
108444
>>> _
3.138196

Here the most of the "costs" are related to the memory-I/O traffic ( for storing twice the 1E6-elements and making them squared ) "halved" problem has been "twice" as fast ~ 52.7 [ms]

>>> np.random.seed(0); clk.start(); _ = ( np.where( np.random.random( 1000000 )**2
...                                               + np.random.random()**2 < 1.0,
...                                       1,
...                                       0
...                                       ).sum() * 4.0 / 1000000 ); clk.stop()
52696

An interim-storage-less numpy-code was slower a bit on localhost in about ~115 [ms]

>>> np.random.seed(0); clk.start(); _ = ( np.where( np.random.random( 1000000 )**2
...                                               + np.random.random( 1000000 )**2 < 1.0,
...                                       1,
...                                       0
...                                       ).sum() * 4.0 / 1000000 ); clk.stop(); print _
114501
3.138196

An ordinary python code with numpy.random PRNG-generator was able to compute the same but in more than 3,937.9+ [ms] ( here you see the python's for-iterators' looping pains - 4 seconds compared to ~ 50 [ms] ) plus you can detect a different order of how PRNG-numbers sequence were generated and pair-wise consumed (seen in the result difference) :

>>> def monte_carlo_pi2(nsamples):
...     np.random.seed(0)
...     acc = 0
...     for i in range(nsamples):
...         if ( np.random.random()**2
...            + np.random.random()**2 ) < 1.0:
...            acc += 1
...     return 4.0 * acc / nsamples
>>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _
3937892
3.140936

A numba.jit()-compiled code was to compute the same in about 692 [ms] as it has to bear and bears also the cost-of-jit-compilation ( only the next call will harvest the fruits of this one-stop-cost, executing in about ~ 50 [ms] ):

>>> @jit(nopython=True)                           # COPY/PASTE
... def monte_carlo_pi2(nsamples):
...     np.random.seed(0)
...     acc = 0
...     for i in range(nsamples):
...         x = np.random.random()
...         y = np.random.random()
...         if (x ** 2 + y ** 2) < 1.0:
...             acc += 1
...     return 4.0 * acc / nsamples 
... 
>>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _
692811
3.140936
>>> np.random.seed( 0 ); clk.start(); _ = monte_carlo_pi2( 1000000 ); clk.stop(); print _
50193
3.140936

EPILOGUE :

Costs matter. Always. A jit-compiled code can help if and only if the LLVM-compiled code is re-used so often, that it can adjust the costs of the initial compilation.

( In case arcane gurus present a fair objection: a trick with a pre-compiled code is still paying that cost, isn't it? )

And the values ?

Using just as few as 1E6 samples is not very convincing, neither for the pi-dart-throwing experiment, nor for the performance benchmarking (as the indeed tiny small-scale of the data samples permits in-cache introduced timing artefacts, that do not scale or fail to generalise ). The larger the scale, the closer the pi-guesstimate gets and the better will perform data-efficient computing ( stream / pair-wise will get better than block-wise ( due to data-instantiation costs and later the memory swapping-related suffocation ) as shown in the online reproducible-experimentation sandbox IDE

# 1E6:
# 1E6: 3.138196           Real time:  0.262 s  User time:  0.268 s  Sys. time: 0.110 s ---------------------------- np.where().sum()                   block-wise
#                         Real time:  0.231 s  User time:  0.237 s  Sys. time: 0.111 s
#
#                         Real time:  0.251 s  User time:  0.265 s  Sys. time: 0.103 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise
#                         Real time:  0.241 s  User time:  0.234 s  Sys. time: 0.124 s
#
#      3.140936           Real time:  1.567 s  User time:  1.575 s  Sys. time: 0.097 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise
#                         Real time:  1.556 s  User time:  1.557 s  Sys. time: 0.102 s
#
# 1E7:
# 1E7: 3.142726           Real time:  0.971 s  User time:  0.719 s  Sys. time: 0.327 s ---------------------------- np.where().sum()                   block-wise
#                         Real time:  0.762 s  User time:  0.603 s  Sys. time: 0.271 s
#
#                         Real time:  0.827 s  User time:  0.604 s  Sys. time: 0.335 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise
#                         Real time:  0.767 s  User time:  0.590 s  Sys. time: 0.288 s
#
#      3.142358           Real time: 14.756 s  User time: 14.619 s  Sys. time: 0.103 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise
#                         Real time: 14.879 s  User time: 14.740 s  Sys. time: 0.117 s
#
# 3E7:
# 3E7: 3.1421996          Real time:  1.914 s  User time:  1.370 s  Sys. time: 0.645 s ---------------------------- np.where().sum()                   block-wise
#                         Real time:  1.796 s  User time:  1.380 s  Sys. time: 0.516 s
#
#                         Real time:  2.325 s  User time:  1.615 s  Sys. time: 0.795 s ---------------------------- np.where( .reshape().sum() ).sum() block-wise
#                         Real time:  2.099 s  User time:  1.514 s  Sys. time: 0.677 s
#
#      3.1416629333333335 Real time: 50.182 s  User time: 49.680 s  Sys. time: 0.107 s ---------------------------- monte_carlo_pi2() -- -- -- -- -- -- pair-wise
#                         Real time: 47.240 s  User time: 46.711 s  Sys. time: 0.103 s
#
# 1E8:
# 1E8: 3.14178916         Real time: 12.970 s  User time:  5.296 s  Sys. time: 7.273 s ---------------------------- np.where().sum()                   block-wise
#                         Real time:  8.275 s  User time:  6.088 s  Sys. time: 2.172 s

And we did not speak about the ultimate performance edge - check a read about the cython with an option to harness an OpenMP-code as a next dose of performance-boosting steroids for python

Upvotes: 1

JoshAdel
JoshAdel

Reputation: 68702

I've reorganized your code a bit:

import numpy 
from numba import jit
import random
from timeit import default_timer as timer

@jit(nopython=True)
def monte_carlo_pi(nsamples):
    random.seed(0)
    acc = 0
    for i in range(nsamples):
        x = random.random()
        y = random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples


num = 1000000

# run the jitted code once to remove compile time from timing
monte_carlo_pi(10)

start = timer()
print(monte_carlo_pi(num))
print("jitted code", timer()-start)

start = timer()
print(monte_carlo_pi.py_func(num))
print("non-jitted", timer()-start)

results in:

3.140936
jitted code 0.01403845699996964
3.14244
non-jitted 0.39901430800000526

Note, you are not running the jitted code on your GPU. The code is compiled, but for your CPU. The reason for the difference in the computed value of Pi is likely due to differing implementations of the underlying random number generator. Numba isn't actually using Python's random module, but has its own implementation that is meant to mimic it. In fact, if you look at the source code, it appears as if the numba implementation is primarily designed based on numpy's random module, and then just aliases the random module from that, so if you swap out random.random for np.random.random, with the same seed, you get the same results:

@jit(nopython=True)
def monte_carlo_pi2(nsamples):
    np.random.seed(0)
    acc = 0
    for i in range(nsamples):
        x = np.random.random()
        y = np.random.random()
        if (x ** 2 + y ** 2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples 

Results in:

3.140936
jitted code 0.013946142999998301
3.140936
non-jitted 0.9277294739999888

And just a few other notes:

  • When timing numba jitted functions, always run the function once to compile it before doing benchmarking so you don't include the one-time compile time cost in the timing
  • You can access the pure python version of a numba jitted function using .py_func, so you don't have to duplicate the code twice.

Upvotes: 3

Related Questions