Reputation: 777
I have the following functions to apply bunch of regexes to each element in a data frame. The dataframe that I am applying the regexes to is a 5MB chunk.
def apply_all_regexes(data, regexes):
# find all regex matches is applied to the pandas' dataframe
new_df = data.applymap(
partial(apply_re_to_cell, regexes))
return regex_applied
def apply_re_to_cell(regexes, cell):
cell = str(cell)
regex_matches = []
for regex in regexes:
regex_matches.extend(re.findall(regex, cell))
return regex_matches
Due to the serial execution of applymap
, the time taken to process is ~ elements * (serial execution of the regexes for 1 element)
. Is there anyway to invoke parallelism? I tried ProcessPoolExecutor
, but that appeared to take longer time than executing serially.
Upvotes: 3
Views: 1285
Reputation: 2639
A bit more modern version:
from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm
tqdm.pandas()
def parallel_applymap(df, func, worker_count):
def _apply(shard):
return shard.progress_applymap(func)
shards = np.array_split(df, worker_count)
with ThreadPoolExecutor(max_workers=worker_count) as e:
futures = e.map(_apply, shards)
return pd.concat(list(futures))
Upvotes: 0
Reputation: 298
Have you tried splitting your one big dataframe in number of threads small dataframes, apply the regex map parallel and stick each small df back together?
I was able to do something similar with a dataframe about gene expression. I would run it small scale and control if you get the expected output.
Unfortunately I don't have enough reputation to comment
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
for x in df_split:
print(x.shape)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
This is the general function I used
Upvotes: 3