Reputation: 57
I have a Pandas DataFrame containing positive integers and a metadata dictionary. The dictionary keys correspond to DataFrame columns and the dictionary values correspond to 1 greater than the maximum possible value in the column. Here's an example of a possible DataFrame and metadata dictionary:
import numpy as np
import pandas as pd
def create_random_df(schema, nrows=10):
df = pd.DataFrame()
for k, v in schema.items():
df[k] = np.random.randint(0, v, nrows)
return df
schema_dict = {0: 3,
1: 2,
2: 4
}
df = create_random_df(schema_dict)
print(df)
0 1 2
0 1 1 1
1 1 1 3
2 0 1 1
3 0 0 0
4 1 1 0
5 1 1 0
6 1 0 2
7 0 1 1
8 0 0 3
9 0 1 2
Notice that in column 0, the dictionary indicates that possible column values are 0, 1, 2 but there are only 0's and 1's.
I am trying to create a dictionary in an efficient manner that gets the proportions of possible values in each column of the DataFrame. The output for the current example would be
{'0_0': 0.5,
'0_1': 0.5,
'0_2': 0,
'1_0': 0.3,
'1_1': 0.7,
'2_0': 0.3,
'2_1': 0.3,
'2_2': 0.2,
'2_3': 0.2}
since in column 0, there are 50% 0's and 50% 1's, with 0% 2's. The way I'm currently doing it involves creating a nested dictionary and flattening it:
import collections
def create_prop_dict(df, schema):
prop_dict = {idx: {k:0 for k in range(possible_values)} for idx, possible_values in schema.items()}
for i in range(len(prop_dict)):
prop_dict[i].update(df[i].value_counts(normalize=True))
return prop_dict
def flatten(d, parent_key='', sep='_'):
items = []
for k, v in d.items():
new_key = str(parent_key) + sep + str(k) if parent_key else str(k)
if isinstance(v, collections.MutableMapping):
items.extend(flatten(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
prop_dict = flatten(create_prop_dict(df, schema_dict))
print(prop_dict)
{'0_0': 0.5,
'0_1': 0.5,
'0_2': 0,
'1_0': 0.3,
'1_1': 0.7,
'2_0': 0.3,
'2_1': 0.3,
'2_2': 0.2,
'2_3': 0.2}
It works fine for this example, but when the DataFrame is large, it's quite slow. For instance, with a larger DataFrame:
schema_dict_large = {k: np.random.randint(1, 50) for k in range(5000)}
df_large = create_random_df(schema_dict_large, 1000) # df_large.shape==(1000,5000)
%timeit prop_dict_large = flatten(create_prop_dict(df_large, schema_dict_large))
5.2 s ± 73.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
How can I create the dictionary of proportions as efficiently as possible? I have looked at Cython and Numba but they don't seem to work on nested dictionaries.
Upvotes: 0
Views: 51
Reputation: 195553
Another solution:
out = {}
for c in df.columns:
g = df.groupby(c)[c].count() / len(df[c])
d = g.to_dict()
for v in range(schema_dict[c]):
out['{}_{}'.format(c, v)] = d.get(v, 0)
from pprint import pprint
pprint(out)
Prints (for example):
{'0_0': 0.5,
'0_1': 0.5,
'0_2': 0,
'1_0': 0.5,
'1_1': 0.5,
'2_0': 0.4,
'2_1': 0.1,
'2_2': 0.3,
'2_3': 0.2}
EDIT: timeit
results for large DataFrame:
3.5940384549976443 # my solution
5.389242456003558 # original solution
Upvotes: 2