Optimizing complex Pandas operation

Question

I have a Pandas DataFrame containing positive integers and a metadata dictionary. The dictionary keys correspond to DataFrame columns and the dictionary values correspond to 1 greater than the maximum possible value in the column. Here's an example of a possible DataFrame and metadata dictionary:

import numpy as np 
import pandas as pd

def create_random_df(schema, nrows=10):
    df = pd.DataFrame()
    for k, v in schema.items():
        df[k] = np.random.randint(0, v, nrows)
    return df

schema_dict = {0: 3, 
               1: 2,
               2: 4
              }

df = create_random_df(schema_dict)
print(df)

    0   1   2
0   1   1   1
1   1   1   3
2   0   1   1
3   0   0   0
4   1   1   0
5   1   1   0
6   1   0   2
7   0   1   1
8   0   0   3
9   0   1   2

Notice that in column 0, the dictionary indicates that possible column values are 0, 1, 2 but there are only 0's and 1's.

I am trying to create a dictionary in an efficient manner that gets the proportions of possible values in each column of the DataFrame. The output for the current example would be

{'0_0': 0.5,
 '0_1': 0.5,
 '0_2': 0,
 '1_0': 0.3,
 '1_1': 0.7,
 '2_0': 0.3,
 '2_1': 0.3,
 '2_2': 0.2,
 '2_3': 0.2}

since in column 0, there are 50% 0's and 50% 1's, with 0% 2's. The way I'm currently doing it involves creating a nested dictionary and flattening it:

import collections

def create_prop_dict(df, schema):
    prop_dict = {idx: {k:0 for k in range(possible_values)} for idx, possible_values in schema.items()}
    for i in range(len(prop_dict)):
        prop_dict[i].update(df[i].value_counts(normalize=True))
    return prop_dict

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = str(parent_key) + sep + str(k) if parent_key else str(k)
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

prop_dict = flatten(create_prop_dict(df, schema_dict))
print(prop_dict)

{'0_0': 0.5,
 '0_1': 0.5,
 '0_2': 0,
 '1_0': 0.3,
 '1_1': 0.7,
 '2_0': 0.3,
 '2_1': 0.3,
 '2_2': 0.2,
 '2_3': 0.2}

It works fine for this example, but when the DataFrame is large, it's quite slow. For instance, with a larger DataFrame:

schema_dict_large = {k: np.random.randint(1, 50) for k in range(5000)}
df_large = create_random_df(schema_dict_large, 1000) # df_large.shape==(1000,5000)

%timeit prop_dict_large = flatten(create_prop_dict(df_large, schema_dict_large))
5.2 s ± 73.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

How can I create the dictionary of proportions as efficiently as possible? I have looked at Cython and Numba but they don't seem to work on nested dictionaries.

Andrej Kesely · Accepted Answer

Another solution:

out = {}
for c in df.columns:
    g = df.groupby(c)[c].count() / len(df[c])
    d = g.to_dict()
    for v in range(schema_dict[c]):
        out['{}_{}'.format(c, v)] = d.get(v, 0)

from pprint import pprint
pprint(out)

Prints (for example):

{'0_0': 0.5,
 '0_1': 0.5,
 '0_2': 0,
 '1_0': 0.5,
 '1_1': 0.5,
 '2_0': 0.4,
 '2_1': 0.1,
 '2_2': 0.3,
 '2_3': 0.2}

EDIT: timeit results for large DataFrame:

3.5940384549976443 # my solution
5.389242456003558  # original solution

Optimizing complex Pandas operation

Answers (1)

Related Questions