How to use numpy.where in a pipe function for pandas dataframe groupby?

Question

Here is a script to simulate the issue I am facing:

import pandas as pd
import numpy as np

data = {
        'a':[1,2,1,1,2,1,1],
        'b':[10,40,20,10,40,10,20],
        'c':[0.3, 0.2, 0.6, 0.4, 0.5, 0.2, 0.8],
        'd':[3, 1, 5, 1, 7, 2., 2.],
        }

df = pd.DataFrame.from_dict(data)

# I apply some custom function to populate column 'e'.
# For demonstration, I am using a very simple function here.
df['e']=df.apply(lambda x: x['c']<=0.3, axis=1)

# This is the column I need to obtain using groupby and pipe/transform
df['f']=[2., 1., 0., 2., 1., 2., 0.]

print(df)

Output:

   a   b    c    d      e    f
0  1  10  0.3  3.0   True  2.0
1  2  40  0.2  1.0   True  1.0
2  1  20  0.6  5.0  False  0.0
3  1  10  0.4  1.0  False  2.0
4  2  40  0.5  7.0  False  1.0
5  1  10  0.2  2.0   True  2.0
6  1  20  0.8  2.0  False  0.0

The logic to be used to find column f is as follows:

For each group of df.groupby(['a', 'b']):

select entries with True value for e.
if there are any item in the selected array:
- find entry with minimum d and return d (in real application, d needs to be manipulated in conjunction with other columns, and the result would be returned)
else:
- return 0

What I have tried:

def func(x):
    print(type(x))
    print(x)
    print('-'*50)

    ind=np.where(x['e']) #<--- How can I implement this?

    if len(ind)>0:
        ind_min=np.argmin(x.iloc[ind]['d'])
        return x.iloc[ind[ind_min]]['d']
    else:
        return 0
    
df['g']=df.groupby(['a', 'b']).pipe(func)

Output:



--------------------------------------------------
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

I get the mentioned error on line: ind=np.where(x['e']) #<--- How can I implement this?

So, how can apply np.where on a pandas.core.groupby.generic.DataFrameGroupBy object?

ouroboros1 · Accepted Answer

You're presenting the XY problem. Here's one approach:

cond = df['c'] <= 0.3

df['f'] = (
    df.assign(filtered_d=df['d'].where(cond))
    .groupby(['a', 'b'])['filtered_d']
    .transform('min')
    .fillna(0)
    )

Output:

   a   b    c    d      e    f
0  1  10  0.3  3.0   True  2.0
1  2  40  0.2  1.0   True  1.0
2  1  20  0.6  5.0  False  0.0
3  1  10  0.4  1.0  False  2.0
4  2  40  0.5  7.0  False  1.0
5  1  10  0.2  2.0   True  2.0
6  1  20  0.8  2.0  False  0.0

Explanation / Intermediate

First, apply Series.where to column 'd' based on the boolean series to keep only the values that we want to consider for min:

# df['d'].where(cond)

0    3.0
1    1.0
2    NaN
3    NaN
4    NaN
5    2.0
6    NaN
Name: d, dtype: float64

Use df.assign to include the result (here as filtered_d) and apply df.groupby + groupby.transform to get 'min'.
Finally, add Series.fillna and assign as a new column.

An alternative way to do this could be:

cond = df['c'] <= 0.3

df['f'] = (
    df.merge(
        df[cond]
        .groupby(['a', 'b'], as_index=False)
        .agg(f=('d', 'min')),
        on=['a', 'b'],
        how='left'
    ).assign(f=lambda x: x['f'].fillna(0))
)

Explanation

Use boolean indexing + df.groupby + groupby.agg + named aggregation.
Merge with df via df.merge + chain df.assign to apply Series.fillna.

How to use numpy.where in a pipe function for pandas dataframe groupby?

Answers (2)

Related Questions