Harsh Vardhan
Harsh Vardhan

Reputation: 143

Performance doesn't improve with Ray working on 4-CPU-cores

I'm trying to rerun the tutorial on my machine, but I'm failing to reproduce the performance improvements as shown in the tutorial.

What could be the reason for it? I have tried looking for solutions but still not able to understand.

performance comparison with & without ray

Upvotes: 2

Views: 1524

Answers (2)

user3666197
user3666197

Reputation: 1

Q : What could be the reason for it?

... an extremely low ( almost Zero ) [PARALLEL] code-execution portion enter image description here
When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious
The Amdahl's Law, defines the rationale WHY, next comes the WHAT :

First:
never start "benchmarking" without having correctly isolated the SuT - the System-under-Test, here being the distributed-form of a computation.

Here, the start = time.time() being " in front of " the import ray statement seems to be rather a provocative test of the readers' concentration on subject, definitely not a sign of a properly engineered use-case test-design - you knowingly take into the measured time also all the disk-I/O latency + data-transfers from disk into the python session, TimeDOMAIN costs of syntax-checks of the imported module, yes - interpretation ( not having the same conditions in the second test )

Next:
After shaving-off the costs of import, one may start to compare "apples to apples":

...
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start   = time.time()
futures = [ f(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " pure-[SERIAL] python execution" )
#----------------------------------------- ORANGES
start   = time.time()
import ray                               # costs of initial import
ray.init( num_cpus = 4 )                 # costs of parametrisation
@ray.remote                              # costs of decorated def(s)
def f( x ):
    return x * x
print( time.time() - start )
print( 60*"_" + " ray add-on overheads" )
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start   = time.time()
futures = [ f.remote(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " ray.remote-decorated python execution" )

Next comes the scaling :

For miniature scales of use, like building all the artillery of parallel/distributed code-execution for just -4- calls, the measurements are possible, yet skewed by many hardware-related and software-related tricks ( memory allocations and cache side-effects being most often the performance blockers, once the SuT has been well crafted not to overshadow these next typical HPC core troubles ).

>>> def f( x ):
...     return x * x
...
>>> dis.dis( f )
  2           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                0 (x)
              6 BINARY_MULTIPLY     
              7 RETURN_VALUE        

Having "low density" of computing ( here taking just one MUL x, x in a straight RET ) will never justify all the initial setup-costs and all the per-call add-on overhead-costs, that matter in small computing-density cases, not so in complex and CPU-intensive HPC computing tasks ( for which the Amdahl's Law says,where are the principal limits for achievable speedups stay ).

The next snippet will show the average per-call costs of f.remote()-calls, spread over 4-CPU ray.remote-processing paths, compared with a plain, monopolistic GIL-stepped mode of processing ( for details on [min, Avg, MAX, StDev] see other benchmarking posts )

#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
test_4N = 1E6                            # 1E6, 1E9, ... larger values may throw exception due to a poor ( not well fused ) range()-iterator construction, workarounds possible
start   = time.time()
futures = [ f.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /             test_4N )
print( 60*"_" + " ray.remote-decorated python execution per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
start   = time.time()
futures = [ f(i) for i in range( int( test_4N ) ) ]
print( time.time() - start /          test_4N )
print( 60*"_" + " pure-[SERIAL] python execution" )

Bonus part

### Smoke-on : ###

If indeed interested in burning some more fuel to make some sense of how the intensive computing may benefit from a just-[CONCURRENT], True-[PARALLEL] or even try to add more CPU-intensive computing, some remarkable RAM memory allocations, go well beyond the CPU core's L3-cache sizes, pass larger BLOB-s between processes in parameters and result(s)' transfers, live near if not slightly beyond the O/S's efficient process-switching and RAM-swapping, simply go closer towards the real-life computing problems, where latency and resulting performance indeed matters :

import numpy as np
@ray.remote
def bigSmoke( voidPar = -1 ):
    #                                +------------- this has to compute a lot
    #      +-------------------------|------------- this has to allocate quite some RAM ~ 130 MB for 100 x str( (2**18)! )
    #      |                         |                 + has to spend add-on overhead costs
    #      |                         |                          for process-to-process result(s) ~ 1.3 MB  for  (2**18)!
    #      |                         |                          SER/DES-transformations & transfer costs
    #      |    +--------------------|------------- this has to allocate quite some RAM ~ 1.3 MB for (2**18)!
    #      |    |                    |           +- this set with care, soon O/S swapping may occur above physical RAM sizes
    return [ str( np.math.factorial( i )     #   |              and immense degradation of the otherwise CPU-processing appears then
                                 for i in int( 1E2 ) * ( 2**18, )
                  )
             ][-1] # <----------------------------- this reduces the amount of SER/DES-transformations & process-2-process transfer costs

...
#----------------------------------------- APPLES-TO-APPLES scaled + computing
test_4N = 1E1                            # be cautious here, may start from 1E0, 1E1, 1E2, 1E3 ...
start   = time.time()
futures = [ bigSmoke.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /                    test_4N )
print( 60*"_" + " ray.remote-decorated set of numpy.math.factorial( 2**18 ) per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled + computing
start   = time.time()
futures = [ bigSmoke(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /             test_4N )
print( 60*"_" + " pure-[SERIAL] python execution of a set of numpy.math.factorial( 2**18 ) per one call" )

Anyway, be warned that premature optimisation efforts are prone to mislead one's focus, so feel free to read performance-tuning stories so often presented here, in Stack Overflow.

Upvotes: 4

Jo37
Jo37

Reputation: 45

Multiprocessing creates time overhead - I think the base function here is so quick that the overhead takes the majority of the time. Does the tutorial really use a simple integer as input ? If you use a large array as input you should see an improvement.

Upvotes: 1

Related Questions