koh-ding
koh-ding

Reputation: 135

How to map a function to an array of dataframes in parallel in python?

My code has an array of dataframes, each of which I want to apply a function to. The dataframes are all in the same format, here's an example of one:

enter image description here

and the code I have for regular mapping is this:

def return_stat(df):
    return np.random.choice(df.iloc[:,1],p=df.iloc[:,0])


weather_df_list = [weather_df1,weather_df2,weather_df3,weather_df4]

expected_values = list(map(lambda i:return_stat(i), weather_df_list))

but I have 16 cores on my computer and want to make use of it to make this code super fast.

How would I implement this same code using parallel computing in Python?

Thanks!

Upvotes: 1

Views: 467

Answers (1)

Arty
Arty

Reputation: 16737

Using multiprocessing.Pool can help to occupy all your cores.

import pandas as pd, numpy as np, multiprocessing

def return_stat(df):
    return np.random.choice(df.iloc[:, 1], p = df.iloc[:, 0])

if __name__ == '__main__':
    weather_df = pd.DataFrame({'rain_probability': [0.1,0.2,0.7], 'rain_inches': [1,2,3]})
    weather_df_list = [weather_df, weather_df, weather_df, weather_df]
    with multiprocessing.Pool() as pool:
        expected_values = pool.map(return_stat, weather_df_list)
    print(expected_values)

Another fancy and also efficient way to solve the problem is using Numba. It transcodes Python into efficient machine code and also has parallelization feature. Although it had no choice() variant supporting probabilities array, hence I had to implement choice() myself. You need to install numba once through python -m pip install numba.

import pandas as pd, numpy as np
from numba import njit

@njit(parallel = True, fastmath = True)
def choices(l):
    rnds = np.random.random((len(l),))
    def choice(i, a, p):
        assert p.shape == a.shape
        p = p.cumsum()
        p = p / p[-1]
        r = rnds[i]
        i = np.sum((p <= r).astype(np.int64))
        return a[i]
    res = np.empty((len(l),), dtype = np.float64)
    for i in range(len(l)):
        res[i] = choice(i, l[i][:, 1], l[i][:, 0])
    return res

weather_df = pd.DataFrame({'rain_probability': [0.1, 0.2, 0.3, 0.4], 'rain_inches': [0, 1, 2, 3]})
weather_df_list = [weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df, weather_df]
weather_df_arrays = [e.values[:, :2] for e in weather_df_list]
print(choices(weather_df_arrays))

You may try numba variant on your side and tell me how fast it is, if it is not faster than multiprocessing variant then I have some extra ideas how to improve its speed.

Upvotes: 1

Related Questions