Reputation: 1351
I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies()
. This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.
I need to complete this for a DataFrame of shape: (3,000,000 x 16)
which outputs a binary matrix of shape: (3,000,000 x 600)
.
During the process, the step of converting to a binary matrix pd.get_dummies()
is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]
. Since I have switch to saving straight to a np.ndarray
which is much faster, I just wonder why? (Please see terminal output at bottom of question for time comparison)
n.b. As pointed out in comments, I could just all pd.get_dummies()
on entire frame. However, some of the columns require tailored preprocessing, i.e: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by ,
or ,
, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=',')
. Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.
Please see code below for each version
DataFrame version:
def preprocess_df(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
# x = pd.DataFrame(columns=all_cols)
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
x.loc[:, dummy_col] = df_col[dummy_col]
print "Assigned: ", col, datetime.datetime.now()
return x.values
np version:
def preprocess_np(df):
with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
cols = pickle.load(handle)
x = np.zeros(shape=(len(df),len(cols)))
for col in df.columns:
# 1. make binary matrix
df_col = pd.get_dummies(df[col], prefix=str(col))
print "Processed: ", col, datetime.datetime.now()
# 2. assign each value in binary matrix to col in output
for dummy_col in df_col.columns:
idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
x[:, idx] = df_col[dummy_col].values.T
print "Assigned: ", col, datetime.datetime.now()
return x
Timed outputs (10,000
examples)
DataFrame version:
Processed: Weekday
Assigned: Weekday 0.437081
Processed: Hour 0.002366
Assigned: Hour 1.33815
np version:
Processed: Weekday
Assigned: Weekday 0.006992
Processed: Hour 0.002632
Assigned: Hour 0.008989
Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000
columns to the output.
Any general advice on the approach I am taking is also appreciated!
Thank you
Upvotes: 7
Views: 1699
Reputation: 1276
One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values
. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.
Upvotes: 1