Coel Locandro
Coel Locandro

Reputation: 13

Pandas groupby transform max solution?

Context:

My dataframe columns are: ID's (with duplicates) in A, time values B, and string values C

I need to update all C values where C corresponds to the most recent time B for each ID A.

Shows the latest time for ID (or group of IDs), but doesn't include C values:

df.groupby(['A'], as_index=False)['B'].max()

My failed attempt after stumbling onto .transform():

df['C'] = df.groupby('A')['B'].transform(['C'].max())
AttributeError: 'list' object has no attribute 'max'

Data

'a':['zx5','zx5','ab1','ab1','mn3','mn3'],
'b':['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/4/2021'],
'c':['aaa','bbb','ccc','ddd','eee','fff']

Desired result:

'a':['zx5','zx5','ab1','ab1','mn3','mn3'],
'b':['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/4/2021'],
'c':['bbb','bbb','ddd','ddd','eee','eee']

If applicable, I am looking for an efficient solution as this consists of csv's with 100,000+ rows of data.

Edit: Due to comments, I've updated IDs (A) to be less simplistic, consider them as random alphanumeric values where sorting doesn't provide benefit

Upvotes: 1

Views: 1085

Answers (2)

Henry Ecker
Henry Ecker

Reputation: 35686

We can convert b to_datetime if not already (for correct sort behaviour), then sort_values so that the maximal b value is last, then groupby transform last and let index alignment associate the values correctly:

# df['b'] = pd.to_datetime(df['b'])  # Convert to_datetime if not already

df['c'] = df.sort_values('b').groupby('a')['c'].transform('last')

df:

     a          b    c
0  zx5 2021-01-01  bbb
1  zx5 2021-01-02  bbb
2  ab1 2021-01-03  ddd
3  ab1 2021-01-04  ddd
4  mn3 2021-01-05  eee
5  mn3 2021-01-04  eee

Some timing via %timeit:

%timeit df.sort_values('b').groupby('a')['c'].transform('last')
275 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['a'].map(df.sort_values(by='b').groupby('a')['c'].last())
322 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 0

mozway
mozway

Reputation: 262149

You can sort by 'a'+'b', keep the most recent 'b' per 'a' group and use this to map your new column 'c' data:

keys = df.sort_values(by='b').groupby('a')['c'].last()
df['c'] = df['a'].map(keys)

output:

    a          b    c
0  A1 2021-01-01  bbb
1  A1 2021-01-02  bbb
2  A2 2021-01-03  ddd
3  A2 2021-01-04  ddd
4  A3 2021-01-05  eee
5  A3 2021-01-04  eee

Upvotes: 1

Related Questions