Reputation: 13
Context:
My dataframe columns are: ID's (with duplicates) in A
, time values B
, and string values C
I need to update all C
values where C
corresponds to the most recent time B
for each ID A
.
Shows the latest time for ID (or group of IDs), but doesn't include C values:
df.groupby(['A'], as_index=False)['B'].max()
My failed attempt after stumbling onto .transform()
:
df['C'] = df.groupby('A')['B'].transform(['C'].max())
AttributeError: 'list' object has no attribute 'max'
Data
'a':['zx5','zx5','ab1','ab1','mn3','mn3'],
'b':['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/4/2021'],
'c':['aaa','bbb','ccc','ddd','eee','fff']
Desired result:
'a':['zx5','zx5','ab1','ab1','mn3','mn3'],
'b':['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/4/2021'],
'c':['bbb','bbb','ddd','ddd','eee','eee']
If applicable, I am looking for an efficient solution as this consists of csv's with 100,000+ rows of data.
Edit: Due to comments, I've updated IDs (A) to be less simplistic, consider them as random alphanumeric values where sorting doesn't provide benefit
Upvotes: 1
Views: 1085
Reputation: 35686
We can convert b
to_datetime
if not already (for correct sort behaviour), then sort_values
so that the maximal b
value is last, then groupby transform
last and let index alignment associate the values correctly:
# df['b'] = pd.to_datetime(df['b']) # Convert to_datetime if not already
df['c'] = df.sort_values('b').groupby('a')['c'].transform('last')
df
:
a b c
0 zx5 2021-01-01 bbb
1 zx5 2021-01-02 bbb
2 ab1 2021-01-03 ddd
3 ab1 2021-01-04 ddd
4 mn3 2021-01-05 eee
5 mn3 2021-01-04 eee
Some timing via %timeit
:
%timeit df.sort_values('b').groupby('a')['c'].transform('last')
275 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['a'].map(df.sort_values(by='b').groupby('a')['c'].last())
322 µs ± 8.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 0
Reputation: 262149
You can sort by 'a'+'b', keep the most recent 'b' per 'a' group and use this to map your new column 'c' data:
keys = df.sort_values(by='b').groupby('a')['c'].last()
df['c'] = df['a'].map(keys)
output:
a b c
0 A1 2021-01-01 bbb
1 A1 2021-01-02 bbb
2 A2 2021-01-03 ddd
3 A2 2021-01-04 ddd
4 A3 2021-01-05 eee
5 A3 2021-01-04 eee
Upvotes: 1