Update a global tqdm progress bar using multiprocessing and iterations on a split pandas DataFrame

Question

For performance reasons, I divide a big Dataframe into several small Dataframes, iterate through each of them and do some calculations. Now I'm trying to create a global progress bar that shows me all the iterations i've made so far. By using multiprocessing, of course, a progress bar is now created for each individual process. Is there a way to update the "overall" progress? Unfortunately, I have not been able to find answers in other forum posts because i do not want to see the progress of each individual sub-processes or how many processes are finished, but the status of all iterations performed in the "do_calculations" function. My code is:

import multiprocessing as mp
from tqdm import tqdm
import pandas as pd

# load the initial dataframe
initial_df = pd.read_csv(r"...") # let's assume len(initial_df)=60

# create a "global" progress bar
pbar = tqdm(total=len(initial_df))

def do_calculations(sub_df):
    """Function that calculates some things for each row of a sub_dataframe."""

    # iterate through the sub_dataframe
    for index, row in sub_df.iterrows():

        # do some calculations
         
        # here i want to update the "global" progress bar for all parallel
        # progresses
        global pbar
        pbar.update(1)

    return sub_df


def execute():
    """Function that executes the 'do_calculations' function using multiprocessing."""
    
    num_processes = mp.cpu_count() - 2  # let's assume num_processes=6
    pool = mp.Pool(processes=num_processes)
    # split the initial dataframe
    divided_df = np.array_split(initial_df, num_processes)
    # execute the 'do_calculations' function using multiprocessing and re-joining the 
    #dataframe
    new_df = pd.concat(pool.map(do_calculations, divided_df))
    pool.close()
    pool.join()
    return new_df


if __name__ == "__main__":
    new_data = execute()

My result so far is:

 17%|█▋        | 10/60 [00:05<00:25,  1.99it/s]
 17%|█▋        | 10/60 [00:05<00:25,  1.97it/s]
 17%|█▋        | 10/60 [00:05<00:25,  1.98it/s]
 17%|█▋        | 10/60 [00:05<00:25,  1.96it/s]
 17%|█▋        | 10/60 [00:05<00:25,  1.98it/s]
 17%|█▋        | 10/60 [00:05<00:25,  1.98it/s]
  0%|          | 0/60 [00:08


My desired result is the number of iterations done in the do_calculations function:
100%|██████████| 60/60 [00:13<00:00,  4.42it/s]

I am not tying the show at which step the 'map' function is:
100%|██████████| 6/6 [00:04<00:00,  1.28it/s]

I am grateful for any help!! Thanks in advance.

Update a global tqdm progress bar using multiprocessing and iterations on a split pandas DataFrame

Answers (1)

Related Questions