learner
learner

Reputation: 656

How to use numpy.where in a pipe function for pandas dataframe groupby?

Here is a script to simulate the issue I am facing:

import pandas as pd
import numpy as np

data = {
        'a':[1,2,1,1,2,1,1],
        'b':[10,40,20,10,40,10,20],
        'c':[0.3, 0.2, 0.6, 0.4, 0.5, 0.2, 0.8],
        'd':[3, 1, 5, 1, 7, 2., 2.],
        }

df = pd.DataFrame.from_dict(data)

# I apply some custom function to populate column 'e'.
# For demonstration, I am using a very simple function here.
df['e']=df.apply(lambda x: x['c']<=0.3, axis=1)

# This is the column I need to obtain using groupby and pipe/transform
df['f']=[2., 1., 0., 2., 1., 2., 0.]

print(df)

Output:

   a   b    c    d      e    f
0  1  10  0.3  3.0   True  2.0
1  2  40  0.2  1.0   True  1.0
2  1  20  0.6  5.0  False  0.0
3  1  10  0.4  1.0  False  2.0
4  2  40  0.5  7.0  False  1.0
5  1  10  0.2  2.0   True  2.0
6  1  20  0.8  2.0  False  0.0

The logic to be used to find column f is as follows:

For each group of df.groupby(['a', 'b']):

What I have tried:

def func(x):
    print(type(x))
    print(x)
    print('-'*50)

    ind=np.where(x['e']) #<--- How can I implement this?

    if len(ind)>0:
        ind_min=np.argmin(x.iloc[ind]['d'])
        return x.iloc[ind[ind_min]]['d']
    else:
        return 0
    
df['g']=df.groupby(['a', 'b']).pipe(func)

Output:

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B348735550>
--------------------------------------------------
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

I get the mentioned error on line: ind=np.where(x['e']) #<--- How can I implement this?

So, how can apply np.where on a pandas.core.groupby.generic.DataFrameGroupBy object?

Upvotes: 2

Views: 82

Answers (2)

rehaqds
rehaqds

Reputation: 2045

One method with a pivot table:

df2 = df[list['abde']].copy() # take only the 4 columns needed
df2['e'] = df2['e'].astype(int)  # transform True/False to 1/0
pt = df2.pivot_table(index=['a', 'b'], columns=['e'], values='d', aggfunc='min').fillna(0)
display(pt) 

pt

pt[1] has the values for the column f for a given ('a', 'b')

df['f_calc'] = df.apply(lambda row: pt[1].loc[row['a'], row['b']], axis=1)
display(df)

res

Upvotes: 1

ouroboros1
ouroboros1

Reputation: 14184

You're presenting the XY problem. Here's one approach:

cond = df['c'] <= 0.3

df['f'] = (
    df.assign(filtered_d=df['d'].where(cond))
    .groupby(['a', 'b'])['filtered_d']
    .transform('min')
    .fillna(0)
    )

Output:

   a   b    c    d      e    f
0  1  10  0.3  3.0   True  2.0
1  2  40  0.2  1.0   True  1.0
2  1  20  0.6  5.0  False  0.0
3  1  10  0.4  1.0  False  2.0
4  2  40  0.5  7.0  False  1.0
5  1  10  0.2  2.0   True  2.0
6  1  20  0.8  2.0  False  0.0

Explanation / Intermediate

  • First, apply Series.where to column 'd' based on the boolean series to keep only the values that we want to consider for min:
# df['d'].where(cond)

0    3.0
1    1.0
2    NaN
3    NaN
4    NaN
5    2.0
6    NaN
Name: d, dtype: float64

An alternative way to do this could be:

cond = df['c'] <= 0.3

df['f'] = (
    df.merge(
        df[cond]
        .groupby(['a', 'b'], as_index=False)
        .agg(f=('d', 'min')),
        on=['a', 'b'],
        how='left'
    ).assign(f=lambda x: x['f'].fillna(0))
)

Explanation

Upvotes: 2

Related Questions