densekernel
densekernel

Reputation: 1351

Python Pandas: Why is numpy so much faster than Pandas for column assignment? Can I optimize further?

I am preprocessing data for a Machine Learning classification task by converting categorical variables to a binary matrix, primarily using pd.get_dummies(). This is applied to a single Pandas DataFrame column and outputs a new DataFrame with the same number of rows as the original and width of unique number of categorical variables in the original column.

I need to complete this for a DataFrame of shape: (3,000,000 x 16) which outputs a binary matrix of shape: (3,000,000 x 600).

During the process, the step of converting to a binary matrix pd.get_dummies() is very quick, but the assignment to the output matrix was much slower using pd.DataFrame.loc[]. Since I have switch to saving straight to a np.ndarray which is much faster, I just wonder why? (Please see terminal output at bottom of question for time comparison)

n.b. As pointed out in comments, I could just all pd.get_dummies() on entire frame. However, some of the columns require tailored preprocessing, i.e: putting into buckets. The most difficult column to handle is a column containing a string of tags (seperated by , or ,, which must be processed like this: df[col].str.replace(' ','').str.get_dummies(sep=','). Also, the preprocessed training set and test set need the same set of columns (inherited from all_cols) as they might not have the same features present once they are broken into a matrix.

Please see code below for each version

DataFrame version:

def preprocess_df(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))
    # x = pd.DataFrame(columns=all_cols)

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            x.loc[:, dummy_col] = df_col[dummy_col]

        print "Assigned: ", col,  datetime.datetime.now()

    return x.values

np version:

def preprocess_np(df):
    with open(PICKLE_PATH + 'cols.pkl', 'rb') as handle:
        cols = pickle.load(handle)

    x = np.zeros(shape=(len(df),len(cols)))

    for col in df.columns:
        # 1. make binary matrix
        df_col = pd.get_dummies(df[col], prefix=str(col))

        print "Processed: ", col,  datetime.datetime.now()

        # 2. assign each value in binary matrix to col in output
        for dummy_col in df_col.columns:
            idx = [i for i,j in enumerate(all_cols) if j == dummy_col][0]
            x[:, idx] = df_col[dummy_col].values.T

        print "Assigned: ", col,  datetime.datetime.now()

    return x

Timed outputs (10,000 examples)

DataFrame version:

Processed:  Weekday 
Assigned:  Weekday 0.437081  
Processed:  Hour 0.002366
Assigned:  Hour 1.33815

np version:

Processed:  Weekday   
Assigned:  Weekday 0.006992
Processed:  Hour 0.002632
Assigned:  Hour 0.008989

Is there a different approach to further optimize this? I am interested as at the moment I am discarding a potentially useful feature as it is too slow to process an extra 15,000 columns to the output.

Any general advice on the approach I am taking is also appreciated!

Thank you

Upvotes: 7

Views: 1699

Answers (1)

Cyrus
Cyrus

Reputation: 1276

One experiment would be to change over to x.loc[:, dummy_col] = df_col[dummy_col].values. If the input is a series, pandas is checking the order of the indices for each assignment. Assigning with an ndarray would turn that off if it's unnecessary, and that should improve performance.

Upvotes: 1

Related Questions