Reputation: 679
I'm trying to compute various metrics on a Pandas DataFrame using the apply
method. Since the DataFrame I'm working with is quiet big (1 million rows x 20 columns), I decided to parallelize the computation process.
In order to reproduce the issue I'm having, I'm going to use the iris dataset. Here are the steps:
# Step 1: Import all required modules + load iris dataset to Pandas DataFrame
import pandas as pd
import numpy as np
import seaborn as sns
from multiprocessing import Pool
iris = pd.DataFrame(sns.load_dataset('iris'))
# Step 2: Define function that adds some metric to initial iris DataFrame
def add_metrics(data):
data['x_1'] = data['species'].apply(lambda x: len(x))
return data
# Step 3: Define parallelization function
num_partitions = 10 # number of partitions to split dataframe
num_cores = 4
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
# Step 4: Add metrics to initial iris DataFrame using parallelization function
iris = parallelize_dataframe(iris, add_metrics)
The above process works perfectly well as it is BUT I want to be able to have additional positional and/or optional arguments in my add_metrics
function. For example, my add_metrics
function might look like the following:
def add_metrics(data, num, keep = False):
data['x_1'] = data['species'].apply(lambda x: len(x))
data['x_2'] = data['sepal_length'].apply(lambda x: x * num)
if keep == True:
data['x_3'] = data['sepal_width'].apply(lambda x: x * num)
return data
Now, no matter how I try to call the parallelize_dataframe
function I'm getting an error. For example:
iris = parallelize_dataframe(iris, add_metrics(iris, 2, keep = True))
throws a TypeError: 'DataFrame' object is not callable
.
I'm fairly new to Python so I don't know what is going wrong here and how to fix my problem. I know the example I chose does not require parallel processing as the iris dataset only contains 150 observation. I used it to easily reproduce my problem.
Any help would be appreciated.
Upvotes: 0
Views: 428
Reputation: 11100
You can use the functools.partial
to set variables in your function before passing to map.
def add(x,y):
return(x+y)
a = [1, 2, 3]
import functools
map(functools.partial(add, y=2), a) # map object [3, 4, 5]
Upvotes: 1