Performance doesn't improve with Ray working on 4-CPU-cores

Question

I'm trying to rerun the ray tutorial on my machine, but I'm failing to reproduce the performance improvements as shown in the tutorial.

What could be the reason for it? I have tried looking for solutions but still not able to understand.

user3666197 · Accepted Answer

Q : What could be the reason for it?

_{... an extremely low ( almost Zero ) [PARALLEL] code-execution portion

When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious} The Amdahl's Law, defines the rationale WHY, next comes the WHAT :

First:
never start "benchmarking" without having correctly isolated the SuT - the System-under-Test, here being the distributed-form of a computation.

Here, the start = time.time() being " in front of " the import ray statement seems to be rather a provocative test of the readers' concentration on subject, definitely not a sign of a properly engineered use-case test-design - you knowingly take into the measured time also all the disk-I/O latency + data-transfers from disk into the python session, TimeDOMAIN costs of syntax-checks of the imported module, yes - interpretation ( not having the same conditions in the second test )

Next:
After shaving-off the costs of import, one may start to compare "apples to apples":

...
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start   = time.time()
futures = [ f(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " pure-[SERIAL] python execution" )
#----------------------------------------- ORANGES
start   = time.time()
import ray                               # costs of initial import
ray.init( num_cpus = 4 )                 # costs of parametrisation
@ray.remote                              # costs of decorated def(s)
def f( x ):
    return x * x
print( time.time() - start )
print( 60*"_" + " ray add-on overheads" )
#----------------------------------------- APPLES-TO-APPLES ( STILL AWFULLY NAIVE )
start   = time.time()
futures = [ f.remote(i) for i in range(4) ]
print( time.time() - start )
print( 60*"_" + " ray.remote-decorated python execution" )

Next comes the scaling :

For miniature scales of use, like building all the artillery of parallel/distributed code-execution for just -4- calls, the measurements are possible, yet skewed by many hardware-related and software-related tricks ( memory allocations and cache side-effects being most often the performance blockers, once the SuT has been well crafted not to overshadow these next typical HPC core troubles ).

>>> def f( x ):
...     return x * x
...
>>> dis.dis( f )
  2           0 LOAD_FAST                0 (x)
              3 LOAD_FAST                0 (x)
              6 BINARY_MULTIPLY     
              7 RETURN_VALUE

Having "low density" of computing ( here taking just one MUL x, x in a straight RET ) will never justify all the initial setup-costs and all the per-call add-on overhead-costs, that matter in small computing-density cases, not so in complex and CPU-intensive HPC computing tasks ( for which the Amdahl's Law says,where are the principal limits for achievable speedups stay ).

The next snippet will show the average per-call costs of f.remote()-calls, spread over 4-CPU ray.remote-processing paths, compared with a plain, monopolistic GIL-stepped mode of processing ( for details on [min, Avg, MAX, StDev] see other benchmarking posts )

#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
test_4N = 1E6                            # 1E6, 1E9, ... larger values may throw exception due to a poor ( not well fused ) range()-iterator construction, workarounds possible
start   = time.time()
futures = [ f.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /             test_4N )
print( 60*"_" + " ray.remote-decorated python execution per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled ( STILL NAIVE )
start   = time.time()
futures = [ f(i) for i in range( int( test_4N ) ) ]
print( time.time() - start /          test_4N )
print( 60*"_" + " pure-[SERIAL] python execution" )

Bonus part

### Smoke-on : ###

If indeed interested in burning some more fuel to make some sense of how the intensive computing may benefit from a just-[CONCURRENT], True-[PARALLEL] or even distributed-computing try to add more CPU-intensive computing, some remarkable RAM memory allocations, go well beyond the CPU core's L3-cache sizes, pass larger BLOB-s between processes in parameters and result(s)' transfers, live near if not slightly beyond the O/S's efficient process-switching and RAM-swapping, simply go closer towards the real-life computing problems, where latency and resulting performance indeed matters :

import numpy as np
@ray.remote
def bigSmoke( voidPar = -1 ):
    #                                +------------- this has to compute a lot
    #      +-------------------------|------------- this has to allocate quite some RAM ~ 130 MB for 100 x str( (2**18)! )
    #      |                         |                 + has to spend add-on overhead costs
    #      |                         |                          for process-to-process result(s) ~ 1.3 MB  for  (2**18)!
    #      |                         |                          SER/DES-transformations & transfer costs
    #      |    +--------------------|------------- this has to allocate quite some RAM ~ 1.3 MB for (2**18)!
    #      |    |                    |           +- this set with care, soon O/S swapping may occur above physical RAM sizes
    return [ str( np.math.factorial( i )     #   |              and immense degradation of the otherwise CPU-processing appears then
                                 for i in int( 1E2 ) * ( 2**18, )
                  )
             ][-1] # <----------------------------- this reduces the amount of SER/DES-transformations & process-2-process transfer costs

...
#----------------------------------------- APPLES-TO-APPLES scaled + computing
test_4N = 1E1                            # be cautious here, may start from 1E0, 1E1, 1E2, 1E3 ...
start   = time.time()
futures = [ bigSmoke.remote(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /                    test_4N )
print( 60*"_" + " ray.remote-decorated set of numpy.math.factorial( 2**18 ) per one call" )
#----------------------------------------- APPLES-TO-APPLES scaled + computing
start   = time.time()
futures = [ bigSmoke(i) for i in range( int( test_4N ) ) ]
print( ( time.time() - start ) /             test_4N )
print( 60*"_" + " pure-[SERIAL] python execution of a set of numpy.math.factorial( 2**18 ) per one call" )

Anyway, be warned that premature optimisation efforts are prone to mislead one's focus, so feel free to read performance-tuning stories so often presented here, in Stack Overflow.

Performance doesn't improve with Ray working on 4-CPU-cores

Answers (2)

_{... an extremely low ( almost Zero ) [PARALLEL] code-execution portion

When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious} The Amdahl's Law, defines the rationale WHY, next comes the WHAT :

Next comes the scaling :

Related Questions

Performance doesn&#39;t improve with Ray working on 4-CPU-cores

Answers (2)

... an extremely low ( almost Zero ) [PARALLEL] code-execution portion When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious The Amdahl's Law, defines the rationale WHY, next comes the WHAT :

Next comes the scaling :

Related Questions

Performance doesn't improve with Ray working on 4-CPU-cores

_{... an extremely low ( almost Zero ) [PARALLEL] code-execution portion

When add-on overheads are added to the revised-Amdahl, "negative" speedup << 1 ( slowdowns ) become obvious} The Amdahl's Law, defines the rationale WHY, next comes the WHAT :