kpax
kpax

Reputation: 661

Pandas: conditional group-specific computations

Let's say I have a table with a key (e.g. customer ID) and two numeric columns C1 and C2. I would like to group rows by the key (customer) and run some aggregators like sum and mean on its columns. After computing group aggregators I would like to assign the results back to each customer row in a DataFrame (as some customer-wide features added to each row).

I can see that I can do something like
df['F1'] = df.groupby(['Key'])['C1'].transform(np.sum)
if I want to aggregate just one column and be able to add the result back to the DataFrame.

Can I make it conditional - can I add up C1 column in a group only for rows whose C2 column is equal to some number X and still be able to add results back to the DataFrame?

How can I run aggregator on a combination of rows like:
np.sum(C1 + C2)?

What would be the simplest and most elegant way to implement it? What is the most efficient way to do it? Can those aggregations be done in a one path?

Thank you in advance.

Upvotes: 5

Views: 6449

Answers (2)

kpax
kpax

Reputation: 661

I found another approach that uses apply() instead of transform(), but you need to join the result table with the input DataFrame and I just haven't figured out yet how to do it. Would appreciate help to finish the table joining part or any better alternatives.

df = pd.DataFrame({'Key': ['a','a','b','b','c','c'],
                            'C1': [1,2,3,4,5,6],
                            'C2': [7,8,9,10,11,12]})

# Group g will be given as a DataFrame
def group_feature_extractor(g):
    feature_1 = (g['C1'] + g['C2']).sum()

    even_C1_filter = g['C1'] % 2 == 0
    feature_2 = g[even_C1_filter]['C2'].sum()

    return pd.Series([feature_1, feature_2], index = ['F1', 'F2'])

# Group once
group = df.groupby(['Key'])

# Extract features from each group
group_features = group.apply(group_feature_extractor)

#
# Join with the input data frame ...
#

Upvotes: 1

chrisb
chrisb

Reputation: 52236

Here's some setup of some dummy data.

In [81]: df = pd.DataFrame({'Key': ['a','a','b','b','c','c'], 
                            'C1': [1,2,3,4,5,6],  
                            'C2': [7,8,9,10,11,12]})
In [82]: df['F1'] = df.groupby('Key')['C1'].transform(np.sum)

In [83]: df
Out[83]: 
   C1  C2 Key  F1
0   1   7   a   3
1   2   8   a   3
2   3   9   b   7
3   4  10   b   7
4   5  11   c  11
5   6  12   c  11

If you want to do a conditional GroupBy, you can just filter the dataframe as it's passed to .groubpy. For example, if you wanted the group sum of 'C1' if C2 is less than 8 or greater than 9.

In [87]: cond = (df['C2'] < 8) | (df['C2'] > 9)

In [88]: df['F2'] = df[cond].groupby('Key')['C1'].transform(np.sum)

In [89]: df
Out[89]: 
   C1  C2 Key  F1  F2
0   1   7   a   3   1
1   2   8   a   3 NaN
2   3   9   b   7 NaN
3   4  10   b   7   4
4   5  11   c  11  11
5   6  12   c  11  11

This works because the transform operation preserves the index, so it will still align with the original dataframe correctly.

If you want to sum the group totals for two columns, probably easiest to do something like this? Someone may have something more clever.

In [93]: gb = df.groupby('Key')

In [94]: df['C1+C2'] = gb['C1'].transform(np.sum) + gb['C2'].transform(np.sum)

Edit: Here's one other way to get group totals for multiple columns. The syntax isn't really any cleaner, but may be more convenient for a large number of a columns.

df['C1_C2'] = gb[['C1','C2']].apply(lambda x: pd.DataFrame(x.sum().sum(), index=x.index, columns=['']))

Upvotes: 6

Related Questions