Best way to optimize data processing in Python

Question

I have the following data processing pipeline in Python 3.8:

about 1.3TB of raw data stored on SSD, subdivided in about 80 different, independent categories, further subdivided in individual 300mb compressed csv.gz
3 main classes, one cleans the raw data into a readable format, the second one aggregates said data and does the math I need, the third one imports the first 2, reads each csv, runs all of those processes and saves the results, iterating through csvs. I have constrains because it's time series data, so I have to do it sequentially because everything depends on previous values for each category (no vectorialization). I already use Cython and Numba wherever possible.

My computer has 12cores/24threads. Ideally, I'd want 24 instances of the program running concurrently, one on each thread, each one exporting data from 1 category sequentially, as fast as possible.

If I only need to export, for example, 3 categories, I'd want the program to run on 24 threads, and each instance could use up to 8 threads.

First, I made a script that contains the 3 classes, and a main that runs everything. If I run this by itself, it'll successfully export 1 category of data, although slowly. We'll call this script.py.

Then, I made a function (that we are gonna call parallelize()) that runs the script.py using:

p = subprocess.Popen(mydir/script.py + [myargs], stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False, preexec_fn=os.setsid) 
p.wait()

Using those, I've tried the following methods all with the same mediocre results:

joblib.Parallel(n_jobs=mp.cpu_count())(delayed(parallelize)(arg) for arg in args). I've tried this with any backend possible, I've tried with n_jobs=mp.cpu_count() / n_categories, I've tried using parallel_backend to specify the settings.
multiprocessing module, with target = parallelize, then p.start()
same multiprocessing module, but with target = a function that calls joblib.Parallel(n_jobs=mp.cpu_count() / n_categories)(delayed(parallelize)(arg) for arg in args)

No matter how I try to do it, the result is always the same: when I start up the program, all cpu cores and threads go to 100%, it starts exporting all categories at once, and it does so fast enough for my needs. Even if I'm just exporting 3 categories, it uses all 24 threads at 100%, indicating that it's making good use of the multithreading. BUT after just 5-10 minutes, all of a sudden it slows down, 1 thread remains used at 100%, the other 23 drop to about 10-20% usage, it only processes 1 category, and if I go to the processes on my Ubuntu System Monitor I see all the python instances running at 0% CPU, except for 1 which is running between 10 and 16%.

If I stop the export (it saves up to the point where it got to) and resume it, same thing happens. It would be paradoxically faster to run, stop and rerun the script every 5 minutes than to just let it run for days. What is stopping my CPU from running at 100% all the time instead of just the first 5 minutes?

I'm not using any async, threading, multithreading or multiprocessing inside my 3 classes in the script for the time being, and the slowest part of said script is iterating the csv rows.

Best way to optimize data processing in Python

Answers (1)

Related Questions