pandas / dask calculate percentages for multiple columns - column-parallel operation

Question

When I have a data frame in pandas like:

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'name': ['A', 'B', 'C', 'D', 'E'],
        'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
        'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
        'target': [0,0,0,1,1],
        'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')

Currently, I use:

FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'

def quotients_slow(df_a):
    # parallelism = 8
    # original = dd.from_pandas(df.copy())
    original = df_a.copy()
    output_df = original
    ratio_weights = {}

    for colname in columnsToBias_keep.union(columnsToDrop):
        # group only a single time
        grouped = original.groupby([colname, target]).size() 
        # calculate first ratio
        df = grouped / original[target].sum() 
        nameCol = "pre_" + colname 
        grouped_res = df.reset_index(name=nameCol) 
        grouped_res = grouped_res[grouped_res[target] == 1] 
        grouped_res = grouped_res.drop(target, 1) 
        # todo persist the result in dict for transformer
        result_1 = grouped_res

        # calculate second ratio
        df = (grouped / grouped.groupby(level=0).sum()) 
        nameCol_2 = "pre2_" + colname 
        grouped = df.reset_index(name=nameCol_2) 
        grouped_res = grouped[grouped[target] == 1] 
        grouped_res = grouped_res.drop(target, 1) 
        result_2 = grouped_res 

        # persist the result in dict for transformer
        # this is required to separate fit and transform stage (later on in a sklearn transformer)
        ratio_weights[nameCol] = result_1
        ratio_weights[nameCol_2] = result_2

        # retrieve results
        res_1 = ratio_weights['pre_' + colname]
        res_2 = ratio_weights['pre2_' + colname]
        # merge ratio_weight with original dataframe
        output_df = pd.merge(output_df, res_1, on=colname, how='left') 
        output_df = pd.merge(output_df, res_2, on=colname, how='left') 
        output_df.loc[(output_df[nameCol].isnull()), nameCol] = 0 
        output_df.loc[(output_df[nameCol_2].isnull()), nameCol_2] = 0 

        if colname in columnsToDrop:
            output_df = output_df.drop(colname, 1)

    return output_df


quotients_slow(df_a)

to calculate the ratio of each group to target:1 for each (categorical) column in two ways. As I want to perform the this operation for multiple columns, I naively iterating all of them. But this operation is very slow. Here in the sample: 10 loops, best of 3: 37 ms per loop. For my real dataset of around 500000 rows and around 100 columns this really takes a while.

Shouldn't it be possible to speed it up (column parallel manner, trivial parallelization) in either dask or pandas? Is there a possibility to implement it more efficiently in plain pandas? Is it possible to reduce the number of passes over the data for computing the quotients?

edit

when trying to use dask.delayed in the for loop to achieve parallelism over the columns, I can't figure out how to build the graph over the columns, as I need to call compute to get the tuples.

delayed_res_name = delayed(compute_weights)(df_a, 'name')
a,b,c,d = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[c] = a
ratio_weights[d] = b

pandas / dask calculate percentages for multiple columns - column-parallel operation

edit

Answers (1)

Related Questions