pythonnumpyparallel-processingmultiprocessingjoblib

Reputation: 31

Why does parallelization over more CPUs/cores scale so poorly with Python?

Background

I am trying to speed-up computation by parallelization (via joblib) using more available cores in Python 3.8, but observed that it does scale poorly.

Trials

I wrote a little script to test and demonstrate the behavior which can be found later. The script (see later) is designed to have a completely independent task doing some iterations of dummy operations using NumPy and Pandas. There is no input and no output to the task, no disc or other I/O, nor any communication or shared memory, just plain CPU and RAM usage. The processes do not use any other resources either other than the occasional request for the current time. Amdahl's Law should not apply to the code here, since there is no common code at all except for process setup.

I ran some experiments with increased workloads by duplicating the tasks using sequential vs. parallelization processing and measured the time it takes for each iteration and the whole (parallel) processes to complete. I ran the script on my Windows 10 laptop, and two AWS EC2 Linux (Amazon Linux 2) machines. The number of parallel processed never exceeded the number of available cores.

Observation

I observed the following (see results later for details, duration in seconds):

In case the number of parallel processed was less than the number of available cores, the total average CPUs utilization (user) never was more than 93%, system calls did not exceed 4%, and no iowait (measured with iostat -hxm 10)
- The workload seems to be distributed equally over the available cores, though, which might be an indication for frequent switches between processes even though there are plenty of cores available
- Interestingly, for sequential processing, the CPU utilization (user) was around 48%
The summed duration of all iterations is only slightly less than the total duration of a process, hence the process setup does not seem to be a major factor
For each doubling of the number of parallel processes there is a decrease in speed per each iteration/process of 50%
Whereas the duration for sequential processing approx. doubles as expected with doubling the workload (total number of iterations), the duration for the parallel processing also increased significantly by approx. 50% per each doubling

These findings in this magnitude are unexpected to me.

Questions

What is the cause for this beavior?

Am I missing something?

How can it be remedied in order to utilize the full prospect of using more cores?

Detailed results

Windows 10

6 CPUs, 12 cores Call: python .\time_parallel_processing.py 1,2,4,8 10

                     Duration/Iter            Duration total TotalIterCount
                              mean        std           mean           mean
Mode   ParallelCount
Joblib 1                  4.363902   0.195268      43.673971             10
       2                  6.322100   0.140654      63.870973             20
       4                  9.270582   0.464706      93.631790             40
       8                 15.489000   0.222859     156.670544             80
Seq    1                  4.409772   0.126686      44.133441             10
       2                  4.465326   0.113183      89.377296             20
       4                  4.534959   0.125097     181.528372             40
       8                  4.444790   0.083315     355.849860             80

AWS c5.4xlarge

8 CPUs, 16 cores Call: python time_parallel_processing.py 1,2,4,8,16 10

                Duration/Iter           Duration total TotalIterCount
                         mean       std           mean           mean
Mode   ParCount
Joblib 1             2.196086  0.009798      21.987626             10
       2             3.392873  0.010025      34.297323             20
       4             4.519174  0.126054      45.967140             40
       8             6.888763  0.676024      71.815990             80
       16           12.191278  0.156941     123.287779            160
Seq    1             2.192089  0.010873      21.945536             10
       2             2.184294  0.008955      43.735713             20
       4             2.201437  0.027537      88.156621             40
       8             2.145312  0.009631     171.805374             80
       16            2.137723  0.018985     342.393953            160

AWS c5.9xlarge

18 CPUs, 36 cores Call: python time_parallel_processing.py 1,2,4,8,16,32 10

                Duration/Iter           Duration total TotalIterCount
                         mean       std           mean           mean
Mode   ParCount
Joblib 1             1.888071  0.023799      18.905295             10
       2             2.797132  0.009859      28.307708             20
       4             3.349333  0.106755      34.199839             40
       8             4.273267  0.705345      45.998927             80
       16            6.383214  1.455857      70.469109            160
       32           10.974141  4.220783     129.671016            320
Seq    1             1.891170  0.030131      18.934494             10
       2             1.866365  0.007283      37.373133             20
       4             1.893082  0.041085      75.813468             40
       8             1.855832  0.007025     148.643725             80
       16            1.896622  0.007573     303.828529            160
       32            1.864366  0.009142     597.301383            320

Script code

import argparse
import sys
import time
from argparse import Namespace
from typing import List
import numpy as np
import pandas as pd
from joblib import delayed
from joblib import Parallel
from tqdm import tqdm

RESULT_COLUMNS = {"Mode": str, "ParCount": int, "ProcessId": int, "IterId": int, "Duration": float}

def _create_empty_data_frame() -> pd.DataFrame:
    return pd.DataFrame({key: [] for key, _ in RESULT_COLUMNS.items()}).astype(RESULT_COLUMNS)

def _do_task() -> None:
    for _ in range(10):
        array: np.ndarray = np.random.rand(2500, 2500)
        _ = np.matmul(array, array)
        data_frame: pd.DataFrame = pd.DataFrame(np.random.rand(250, 250), columns=list(map(str, list(range(250)))))
        _ = data_frame.merge(data_frame)

def _process(process_id: int, iter_count: int) -> pd.DataFrame:
    durations: pd.DataFrame = _create_empty_data_frame()
    for i in tqdm(range(iter_count)):
        iter_start_time: float = time.time()
        _do_task()
        durations = durations.append(
            {
                "Mode": "",
                "ParCount": 0,
                "ProcessId": process_id,
                "IterId": i,
                "Duration": time.time() - iter_start_time,
            },
            ignore_index=True,
        )
    return durations

def main(args: Namespace) -> None:
    """Execute main script."""
    iter_durations: List[pd.DataFrame] = []
    mode_durations: List[pd.DataFrame] = []

    for par_count in list(map(int, args.par_counts.split(","))):
        total_iter_count: int = par_count * int(args.iter_count)

        print(f"\nRunning {par_count} processes in parallel and {total_iter_count} iterations in total")
        start_time_joblib: float = time.time()
        with Parallel(n_jobs=par_count) as parallel:
            joblib_durations: List[pd.DataFrame] = parallel(
                delayed(_process)(process_id, int(args.iter_count)) for process_id in range(par_count)
            )
        iter_durations.append(pd.concat(joblib_durations).assign(**{"Mode": "Joblib", "ParCount": par_count}))
        end_time_joblib: float = time.time()

        print(f"\nRunning {par_count} processes sequentially with {total_iter_count} iterations in total")
        start_time_seq: float = time.time()
        seq_durations: List[pd.DataFrame] = []
        for process_id in range(par_count):
            seq_durations.append(_process(process_id, int(args.iter_count)))
        iter_durations.append(pd.concat(seq_durations).assign(**{"Mode": "Seq", "ParCount": par_count}))
        end_time_seq: float = time.time()

        mode_durations.append(
            pd.DataFrame(
                {
                    "Mode": ["Joblib", "Seq"],
                    "ParCount": [par_count] * 2,
                    "Duration": [end_time_joblib - start_time_joblib, end_time_seq - start_time_seq],
                    "TotalIterCount": [total_iter_count] * 2,
                }
            )
        )

    print("\nDuration in seconds")
    grouping_columns: List[str] = ["Mode", "ParCount"]
    print(
        pd.concat(iter_durations)
        .groupby(grouping_columns)
        .agg({"Duration": ["mean", "std"]})
        .merge(
            pd.concat(mode_durations).groupby(grouping_columns).agg({"Duration": ["mean"], "TotalIterCount": "mean"}),
            on=grouping_columns,
            suffixes=["/Iter", " total"],
            how="inner",
        )
    )

if __name__ == "__main__":
    print(f"Command line: {sys.argv}")
    parser: argparse.ArgumentParser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "par_counts",
        help="Comma separated list of parallel processes counts to start trials for (e.g. '1,2,4,8,16,32')",
    )
    parser.add_argument("iter_count", help="Number of iterations per parallel process to carry out")
    args: argparse.Namespace = parser.parse_args()

    start_time: float = time.time()
    main(args)
    print(f"\nTotal elapsed time: {time.time() - start_time:.2f} seconds")

Environment

Created with' conda env create -f environment.time_parallel.yaml environment.time_parallel.yaml:

name: time_parallel
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8.5
  - pip=20.3.3
  - pandas=1.2.0
  - numpy=1.19.2
  - joblib=1.0.0
  - tqdm=4.55.1

Update 1

Thanks to the coment of @sholderbach I investigated into the NumPy/Pandas usage and found out a couple of things.

1)

NumPy uses a linear algebra backend which automatically will run some commands (including matrix multiplication) in parallel threads which results in too many threads altogether clogging the system, the more parallel processes, the more, hence the increasing duration per iteration. I tested this hypthesis by removing NumPy and Pandas operations in method _do_task adn replacing it by simple math operations only:

def _do_task() -> None:
    for _ in range(10):
        for i in range(10000000):
            _ = 1000 ^ 2 % 200

The results are exactly as expected in that the duration of an iteration does not change when increasing the number of processes (beyond the number of cores available).

Windows 10

Call python time_parallel_processing.py 1,2,4,8 5

Duration in seconds
                  Duration/Iter           Duration total TotalIterCount
                           mean       std           mean           mean
Mode     ParCount
Joblib   1             2.562570  0.015496      13.468393              5
         2             2.556241  0.021074      13.781174             10
         4             2.565614  0.054754      16.171828             20
         8             2.630463  0.258474      20.328055             40
Seq      2             2.576542  0.033270      25.874965             10

AWS c5.9xlarge

Call python time_parallel_processing.py 1,2,4,8,16,32 10

                Duration/Iter           Duration total TotalIterCount
                         mean       std           mean           mean
Mode   ParCount
Joblib 1             2.082849  0.022352      20.854512             10
       2             2.126195  0.034078      21.596860             20
       4             2.287874  0.254493      27.420978             40
       8             2.141553  0.030316      21.912917             80
       16            2.156828  0.137937      24.483243            160
       32            3.581366  1.197282      42.884399            320
Seq    2             2.076256  0.004231      41.571033             20

2)

Following the hint of @sholderbach I found a number of other links which cover the topic of linear algebra backends using multiple threads automatically and how to turn this off:

NumPy issue (from @sholderbach)
threadpoolctl package
- Nice article
Pinning process to a specific CPU with Python (and package psutil)

Add to _process:

proc = psutil.Process()
proc.cpu_affinity([process_id])
with threadpool_limits(limits=1):   
   ...

Add to environment:

- threadpoolctl=2.1.0
- psutil=5.8.0

Note: I had to replace joblib by multiprocessing, since pinning did not work properly with joblib (only one half of the processes got spawned at a time on Linux).

I did some tests with mixed results. Monitoring shows that pinnng and restricting to one thread per process works for both Windows 10 and Linux/AWS c5.9xlarge. Unfortunately, the absolute duration per iteration increases by these "fixes". Also, the duration per iteration still begins to increase at some point of parallelization.

Here are the results:

Windows 10 Call: python time_parallel_processing.py 1,2,4,8 5

                Duration/Iter           Duration total TotalIterCount
                         mean       std           mean           mean
Mode   ParCount
Joblib 1             9.502184  0.046554      47.542230              5
       2             9.557120  0.092897      49.488612             10
       4             9.602235  0.078271      50.249238             20
       8            10.518716  0.422020      60.186707             40
Seq    2             9.493682  0.062105      95.083382             10

AWS c5.9xlarge Call python time_parallel_processing.py 1,2,4,8,16,20,24,28,32 5

                  Duration/Iter           Duration total TotalIterCount
                           mean       std           mean           mean
Mode     ParCount
Parallel 1             5.271010  0.008730      15.862883              3
         2             5.400430  0.016094      16.271649              6
         4             5.708021  0.069001      17.428172             12
         8             6.088623  0.179789      18.745922             24
         16            8.330902  0.177772      25.566504             48
         20           10.515132  3.081697      47.895538             60
         24           13.506221  4.589382      53.348917             72
         28           16.318631  4.961513      57.536180             84            
         32           19.800182  4.435462      64.717435             96
Seq      2             5.212529  0.037129      31.332297              6

Upvotes: 3

Answers (2)

fumito

Reputation: 324

When I had a scaling issue with ipyparallel, it was caused by garbage collection during the run. The source code of timeit shows how to disable gc properly.

Upvotes: 0

ti7

Reputation: 18866

What is the cause for this behavior?

Very generally, this type of slowdown usually indicates some combination of being blocked by the GIL, context-switching between cores, or doing a lot of pickling

Am I missing something?

You may be missing some small issues - try profiling (some sampling profiler may be much more performant than cProfile) to see where the time is spent!
However, there's still a finite limit to how fast this can be made before you are reimplementing the suggestions below

How can it be remedied in order to utilize the full prospect of using more cores?

Take a look at numba and dask, which can allow you to get tremendous speedups on numpy and pandas code through parallelization that steps outside of the GIL

numba compiles numpy code and caches it for greater speed and practical processor operations
dask is a framework full of good tricks for efficient parallelization on a single and multiple systems