Apply function to each cell in DataFrame multithreadedly in pandas

Question

Is it possible to apply function to each cell in a DataFrame multithreadedly in pandas?

I'm aware of pandas.DataFrame.applymap but it doesn't seem to allow multithreading natively:

import numpy as np
import pandas as pd
np.random.seed(1)
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), 
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)
format = lambda x: '%.2f' % x
frame = frame.applymap(format)
print(frame)

returns:

               b         d         e
Utah    1.624345 -0.611756 -0.528172
Ohio   -1.072969  0.865408 -2.301539
Texas   1.744812 -0.761207  0.319039
Oregon -0.249370  1.462108 -2.060141

            b      d      e
Utah     1.62  -0.61  -0.53
Ohio    -1.07   0.87  -2.30
Texas    1.74  -0.76   0.32
Oregon  -0.25   1.46  -2.06

Instead, I would like to use more than one core to perform the operation, since the applied function may be complex.

SayPy · Accepted Answer

Split by columns:

from multiprocessing import Pool

def format(col):
    return col.apply(lambda x: '%.2f' % x)

cores = 5
pool = Pool(cores)
for out_col in pool.imap(format, [frame[i] for i in frame]):
    frame[out_col.name] = out_col
pool.close()
pool.join()

Or split by partitions size as mentioned in comments:

size = 10
frame_split = np.array_split(frame, size)
frame = pd.concat(pool.imap(func, frame_split))

Apply function to each cell in DataFrame multithreadedly in pandas

Answers (2)

Related Questions