marrowgari
marrowgari

Reputation: 427

How to groupby and update values in pandas?

I have a pandas DataFrame that looks similar to the following...

>>> df = pd.DataFrame({
...     'col1':['A','C','B','A','B','C','A'],
...     'col2':[np.nan,1.,np.nan,1.,1.,np.nan,np.nan],
...     'col3':[0,1,9,4,2,3,5],
...     })
>>> df
  col1  col2  col3
0    A   NaN     0
1    C   1.0     1
2    B   NaN     9
3    A   1.0     4
4    B   1.0     2
5    C   NaN     3
6    A   NaN     5

What I would like to do is group the rows of col1 by value and then update any NaN values in col2 to increment in value by 1 based on the last highest value of that group in col1.

So that my expected results would look like the following...

>>> df
  col1  col2  col3
0    A   1.0     4
1    A   2.0     0
2    A   3.0     5
3    B   1.0     2
4    B   2.0     9
5    C   1.0     1
6    C   2.0     3

I believe I can use something like groupby on col1 though I'm unsure how to increment the value in col2 based on the last highest value of the group from col1. I've tried the following, but instead of incrementing the value of col1 it updates the value to all 1.0 and adds an additional column...

>>> df1 = df.groupby(['col1'], as_index=False).agg({'col2': 'min'})
>>> df = pd.merge(df1, df, how='left', left_on=['col1'], right_on=['col1'])
>>> df
  col1  col2_x  col2_y  col3
0    A     1.0     NaN     0
1    A     1.0     1.0     1
2    A     1.0     NaN     5
3    B     1.0     NaN     9
4    B     1.0     1.0     4
5    C     1.0     1.0     2
6    C     1.0     NaN     3

Upvotes: 1

Views: 2774

Answers (2)

jezrael
jezrael

Reputation: 862431

Use GroupBy.cumcount only for rows with missing values, add maximum value per group with GroupBy.transform and max and last replace by original values by fillna:

df = pd.DataFrame({
    'col1':['A','C','B','A','B','B','B'],
     'col2':[np.nan,1.,np.nan,1.,3.,np.nan, 0],
     'col3':[0,1,9,4,2,3,4],
     })
print (df)
  col1  col2  col3
0    A   NaN     0
1    C   1.0     1
2    B   NaN     9
3    A   1.0     4
4    B   3.0     2
5    B   NaN     3
6    B   0.0     4

df = df.sort_values(['col1','col2'], na_position='last')
s = df.groupby('col1')['col2'].transform('max')
df['new'] = (df[df['col2'].isna()]
                   .groupby('col1')
                   .cumcount()
                   .add(1)
                   .add(s)
                   .fillna(df['col2']).astype(int))


print (df)
  col1  col2  col3  new
3    A   1.0     4    1
0    A   NaN     0    2
6    B   0.0     4    0
4    B   3.0     2    3
2    B   NaN     9    4
5    B   NaN     3    5
1    C   1.0     1    1

Upvotes: 3

Joe
Joe

Reputation: 12417

Another way:

df['col2_new'] = df.groupby('col1')['col2'].apply(lambda x: x.replace(np.nan, x.value_counts().index[0]+1))
df = df.sort_values('col1')

Upvotes: 2

Related Questions