Georg Heiler
Georg Heiler

Reputation: 17724

pandas / dask calculate percentages for multiple columns - column-parallel operation

When I have a data frame in pandas like:

raw_data = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'name': ['A', 'B', 'C', 'D', 'E'],
        'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
        'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
        'target': [0,0,0,1,1],
        'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')

Currently, I use:

FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'

def quotients_slow(df_a):
    # parallelism = 8
    # original = dd.from_pandas(df.copy())
    original = df_a.copy()
    output_df = original
    ratio_weights = {}

    for colname in columnsToBias_keep.union(columnsToDrop):
        # group only a single time
        grouped = original.groupby([colname, target]).size() 
        # calculate first ratio
        df = grouped / original[target].sum() 
        nameCol = "pre_" + colname 
        grouped_res = df.reset_index(name=nameCol) 
        grouped_res = grouped_res[grouped_res[target] == 1] 
        grouped_res = grouped_res.drop(target, 1) 
        # todo persist the result in dict for transformer
        result_1 = grouped_res

        # calculate second ratio
        df = (grouped / grouped.groupby(level=0).sum()) 
        nameCol_2 = "pre2_" + colname 
        grouped = df.reset_index(name=nameCol_2) 
        grouped_res = grouped[grouped[target] == 1] 
        grouped_res = grouped_res.drop(target, 1) 
        result_2 = grouped_res 

        # persist the result in dict for transformer
        # this is required to separate fit and transform stage (later on in a sklearn transformer)
        ratio_weights[nameCol] = result_1
        ratio_weights[nameCol_2] = result_2

        # retrieve results
        res_1 = ratio_weights['pre_' + colname]
        res_2 = ratio_weights['pre2_' + colname]
        # merge ratio_weight with original dataframe
        output_df = pd.merge(output_df, res_1, on=colname, how='left') 
        output_df = pd.merge(output_df, res_2, on=colname, how='left') 
        output_df.loc[(output_df[nameCol].isnull()), nameCol] = 0 
        output_df.loc[(output_df[nameCol_2].isnull()), nameCol_2] = 0 

        if colname in columnsToDrop:
            output_df = output_df.drop(colname, 1)

    return output_df


quotients_slow(df_a)

to calculate the ratio of each group to target:1 for each (categorical) column in two ways. As I want to perform the this operation for multiple columns, I naively iterating all of them. But this operation is very slow. Here in the sample: 10 loops, best of 3: 37 ms per loop. For my real dataset of around 500000 rows and around 100 columns this really takes a while.

Shouldn't it be possible to speed it up (column parallel manner, trivial parallelization) in either dask or pandas? Is there a possibility to implement it more efficiently in plain pandas? Is it possible to reduce the number of passes over the data for computing the quotients?

edit

when trying to use dask.delayed in the for loop to achieve parallelism over the columns, I can't figure out how to build the graph over the columns, as I need to call compute to get the tuples.

delayed_res_name = delayed(compute_weights)(df_a, 'name')
a,b,c,d = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[c] = a
ratio_weights[d] = b

Upvotes: 0

Views: 1040

Answers (1)

andrew_reece
andrew_reece

Reputation: 21284

Here's a reasonably fast solution for your first quotient, using Pandas. It assumes you are not interested in computing proportions for subject_id. I also added some data to your example to cover more edge cases.

First, generate sample data:

raw_data = {
    'subject_id': ['1', '2', '3', '4', '5', '6','7'],
    'name': ['A', 'B', 'C', 'D', 'E', 'A','A'],
    'nationality': ['DE', 'AUT', 'US', 'US', 'US', 'DE','DE'],
    'alotdifferent': ['x', 'y', 'z', 'x', 'a','x','z'],
    'target': [0,0,0,1,1,0,1],
    'age_group' : [1, 2, 1, 3, 1, 2,1]}

df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])

Now compute proportions and measure speed:

def compute_prop(group):
    return group.sum() / float(group.count())

def build_master(df):
    master = df.copy()
    fields = df.drop(['subject_id','target'],1).columns

    for field in fields:
        master = (pd.merge(master, df.groupby(field, as_index=False)
                                     .agg({'target':compute_prop})
                                     .rename(columns={'target':'pre_{}'.format(field)}), 
                           on=field)
             )

    master.sort_values('subject_id')
    return master

%timeit master = build_master(df_a)
10 loops, best of 3: 17.1 ms per loop

Output:

  subject_id name nationality alotdifferent  target  age_group  pre_name  \
0          1    A          DE             x       0          1  0.333333   
5          2    B         AUT             y       0          2  0.000000   
2          3    C          US             z       0          1  0.000000   
6          4    D          US             x       1          3  1.000000   
3          5    E          US             a       1          1  1.000000   
4          6    A          DE             x       0          2  0.333333   
1          7    A          DE             z       1          1  0.333333   

   pre_nationality  pre_alotdifferent  pre_age_group  
0         0.333333           0.333333            0.5  
5         0.000000           0.000000            0.0  
2         0.666667           0.500000            0.5  
6         0.666667           0.333333            1.0  
3         0.666667           1.000000            0.5  
4         0.333333           0.333333            0.0  
1         0.333333           0.500000            0.5  

Upvotes: 1

Related Questions