Aly
Aly

Reputation: 367

How to mask values in column based on a condition per group

I have pandas DataFrame like this:

data = {'ID_1':['A', 'A','A', 'B', 'B', 'B'],
        'ID_2':[1, 2, 2, 1, 1, 2],
        'DATE':['2021-11-21', '2021-12-19', '2021-09-05', '2021-11-07', '2021-12-05','2021-12-26'],
        'VALUE': [0.5, 0.5, 0.5, 0.6, 0.6, 0.6]}
 
df = pd.DataFrame(data)

And I would like to leave value in column 'VALUE' only for lowest date from column 'DATE' for subset of 'ID_1' and 'ID_2'
Desired output look like this:

data = {'ID_1':['A', 'A','A', 'B', 'B', 'B'],
        'ID_2':[1, 2, 2, 1, 1, 2],
        'DATE':['2021-11-21', '2021-12-19', '2021-09-05', '2021-11-07', '2021-12-05','2021-12-26'],
        'VALUE': [0.5, np.NaN, 0.5, 0.6, np.NaN, 0.6]}
 
df = pd.DataFrame(data)

What I tried is to create function which twice grouping this dataframe but I'm ending with ValueError Length of values (2) does not match length of index (1)

My function:

def foo(val):
    
    def add_mask(val):
        val.reset_index(inplace=True)
        min_date = val['DATE'].min()
        mask = val.DATE == min_date
        return val[mask]
    
    return val.groupby('ID_1').apply(add_mask)

test = df.groupby('ID_2').apply(foo)

Upvotes: 3

Views: 1378

Answers (4)

user2314737
user2314737

Reputation: 29397

Just another way to do it:

df['DATE'] = df.groupby(['ID_1','ID_2']).DATE.transform(lambda x: (x==min(x))*x).replace('', np.NaN)

Using the boolean (x==min(x)):

df['is_min'] = df.groupby(['ID_1','ID_2']).DATE.transform(lambda x: x==min(x))
#
#   ID_1  ID_2        DATE  VALUE  is_min
# 0    A     1  2021-11-21    0.5    True
# 1    A     2  2021-12-19    0.5   False
# 2    A     2  2021-09-05    0.5    True
# 3    B     1  2021-11-07    0.6    True
# 4    B     1  2021-12-05    0.6   False
# 5    B     2  2021-12-26    0.6    True
    

Upvotes: 0

user7864386
user7864386

Reputation:

You can groupby "ID_1" and "ID_2" and transform the min of "DATE" for each group for the DataFrame. Then use eq to identify the rows where the group mins exist. Finally, use where to assign NaN values to "VALUE"s that are not min:

df['VALUE'] = df['VALUE'].where(df.groupby(['ID_1','ID_2'])['DATE'].transform('min').eq(df['DATE']))

Output:

  ID_1  ID_2        DATE  VALUE
0    A     1  2021-11-21    0.5
1    A     2  2021-12-19    NaN
2    A     2  2021-09-05    0.5
3    B     1  2021-11-07    0.6
4    B     1  2021-12-05    NaN
5    B     2  2021-12-26    0.6

Function foo doesn't work because you never use mask you create in it to modify "VALUE" in each group. If you replace

return val[mask]

with

val['VALUE'] = val['VALUE'].where(mask)
return val

it will produce the expected outcome (you'll need to fix the index but the general structure will be what you expect).

Upvotes: 2

quest
quest

Reputation: 3936

Many elegant answers, but here is how I will go about it;

grp = df.groupby(["ID_1", "ID_2"])
grp

def change(df):
    df.loc[df.DATE != df.DATE.min(), 'VALUE'] = np.nan
    return df

grp.apply(change)

Results in:

    ID_1    ID_2    DATE    VALUE
0   A   1   2021-11-21  0.5
1   A   2   2021-12-19  NaN
2   A   2   2021-09-05  0.5
3   B   1   2021-11-07  0.6
4   B   1   2021-12-05  NaN
5   B   2   2021-12-26  0.6

Upvotes: 0

mozway
mozway

Reputation: 262124

You can groupby your two ID columns and compute the min with transform('min'), then mask the data where the value is not equal:

df['VALUE'] = df['VALUE'].mask(df['DATE'].ne(df.groupby(['ID_1', 'ID_2'])['DATE'].transform('min')))

output:

  ID_1  ID_2        DATE  VALUE
0    A     1  2021-11-21    0.5
1    A     2  2021-12-19    NaN
2    A     2  2021-09-05    0.5
3    B     1  2021-11-07    0.6
4    B     1  2021-12-05    NaN
5    B     2  2021-12-26    0.6

Upvotes: 4

Related Questions