Reputation: 17724
When I have a data frame in pandas like:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
'target': [0,0,0,1,1],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')
Currently, I use:
FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'
def quotients_slow(df_a):
# parallelism = 8
# original = dd.from_pandas(df.copy())
original = df_a.copy()
output_df = original
ratio_weights = {}
for colname in columnsToBias_keep.union(columnsToDrop):
# group only a single time
grouped = original.groupby([colname, target]).size()
# calculate first ratio
df = grouped / original[target].sum()
nameCol = "pre_" + colname
grouped_res = df.reset_index(name=nameCol)
grouped_res = grouped_res[grouped_res[target] == 1]
grouped_res = grouped_res.drop(target, 1)
# todo persist the result in dict for transformer
result_1 = grouped_res
# calculate second ratio
df = (grouped / grouped.groupby(level=0).sum())
nameCol_2 = "pre2_" + colname
grouped = df.reset_index(name=nameCol_2)
grouped_res = grouped[grouped[target] == 1]
grouped_res = grouped_res.drop(target, 1)
result_2 = grouped_res
# persist the result in dict for transformer
# this is required to separate fit and transform stage (later on in a sklearn transformer)
ratio_weights[nameCol] = result_1
ratio_weights[nameCol_2] = result_2
# retrieve results
res_1 = ratio_weights['pre_' + colname]
res_2 = ratio_weights['pre2_' + colname]
# merge ratio_weight with original dataframe
output_df = pd.merge(output_df, res_1, on=colname, how='left')
output_df = pd.merge(output_df, res_2, on=colname, how='left')
output_df.loc[(output_df[nameCol].isnull()), nameCol] = 0
output_df.loc[(output_df[nameCol_2].isnull()), nameCol_2] = 0
if colname in columnsToDrop:
output_df = output_df.drop(colname, 1)
return output_df
quotients_slow(df_a)
to calculate the ratio of each group to target:1
for each (categorical) column in two ways. As I want to perform the this operation for multiple columns, I naively iterating all of them. But this operation is very slow.
Here in the sample: 10 loops, best of 3: 37 ms per loop
. For my real dataset of around 500000 rows and around 100 columns this really takes a while.
Shouldn't it be possible to speed it up (column parallel manner, trivial parallelization) in either dask or pandas? Is there a possibility to implement it more efficiently in plain pandas? Is it possible to reduce the number of passes over the data for computing the quotients?
when trying to use dask.delayed
in the for loop to achieve parallelism over the columns, I can't figure out how to build the graph over the columns, as I need to call compute to get the tuples.
delayed_res_name = delayed(compute_weights)(df_a, 'name')
a,b,c,d = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[c] = a
ratio_weights[d] = b
Upvotes: 0
Views: 1040
Reputation: 21284
Here's a reasonably fast solution for your first quotient, using Pandas. It assumes you are not interested in computing proportions for subject_id
. I also added some data to your example to cover more edge cases.
First, generate sample data:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '6','7'],
'name': ['A', 'B', 'C', 'D', 'E', 'A','A'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US', 'DE','DE'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a','x','z'],
'target': [0,0,0,1,1,0,1],
'age_group' : [1, 2, 1, 3, 1, 2,1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
Now compute proportions and measure speed:
def compute_prop(group):
return group.sum() / float(group.count())
def build_master(df):
master = df.copy()
fields = df.drop(['subject_id','target'],1).columns
for field in fields:
master = (pd.merge(master, df.groupby(field, as_index=False)
.agg({'target':compute_prop})
.rename(columns={'target':'pre_{}'.format(field)}),
on=field)
)
master.sort_values('subject_id')
return master
%timeit master = build_master(df_a)
10 loops, best of 3: 17.1 ms per loop
Output:
subject_id name nationality alotdifferent target age_group pre_name \
0 1 A DE x 0 1 0.333333
5 2 B AUT y 0 2 0.000000
2 3 C US z 0 1 0.000000
6 4 D US x 1 3 1.000000
3 5 E US a 1 1 1.000000
4 6 A DE x 0 2 0.333333
1 7 A DE z 1 1 0.333333
pre_nationality pre_alotdifferent pre_age_group
0 0.333333 0.333333 0.5
5 0.000000 0.000000 0.0
2 0.666667 0.500000 0.5
6 0.666667 0.333333 1.0
3 0.666667 1.000000 0.5
4 0.333333 0.333333 0.0
1 0.333333 0.500000 0.5
Upvotes: 1