Reputation: 471
I'm trying to keep rows with a condition. There are two columns in my sample dataframe. Name column contains duplicated names and the Day column contains days in timedelta64[ns].
df
Name Day
Andrew 23 days
Andrew 5 days
Andrew -345 days
Andrew 84 days
Bess 2 days
Bess -83 days
Bess -123 days
Derek 344 days
Derek 238 days
Leslie 47 days
Leslie 543 days
Leslie 134 days
Leslie -3 days
The operation would keep the element Day column that is closest to 0 by each name in the Name column.
Desired df
Name Day
Andrew 5 days
Bess 2 days
Derek 238 days
Leslie -3 days
Any help would be appreciated! Thanks in advance!!
Upvotes: 1
Views: 125
Reputation: 26676
Here is another way. groupby()
name, find absolute difference, and select smallest using nsmallest(1)
df.groupby('Name')['Day'].apply(lambda x: abs(x).nsmallest(1))
Name
Andrew 1 5 days
Bess 4 2 days
Derek 8 238 days
Leslie 12 3 days
Upvotes: 1
Reputation: 71687
Take the abs
value of column Day
and use Series.groupby
to group the column on Name
and use aggregation function idxmin
to get the indices of values closest to zero
, finally use Series.map
to map the indices to thie corresponding timedelta
values:
s = df['Day'].abs().groupby(df['Name']).idxmin().map(df['Day'])
Result:
# print(s)
Name
Andrew 5 days
Bess 2 days
Derek 238 days
Leslie -3 days
Name: Day, dtype: timedelta64[ns]
Upvotes: 1
Reputation: 28709
Since the aim is to look for values closed to 0, get the absolute value of each entry in the Day
column, look for the minimum index and get the values on those indices. The idea is from a solution by @Ch3ster
df.groupby("Name").agg(lambda x: x[x.abs().idxmin()])
Name
Andrew 5
Bess 2
Derek 238
Leslie -3
dtype: int64
Upvotes: 1
Reputation: 98
I'd go for something like this, assuming that your data frame is called df
. The solution is still valid, even if you have more columns in your data frame.
import numpy as np
(df
.assign(Diff_from_0=np.abs(df['Day']-0)) # assign a temporary column for the calculations
.sort_values('Diff_from_0') # sort values by our temporary column (ascending)
.groupby(level=0) # group by the index (names)
.head(1) # get first row (lowest diff from 0) for each unique index value
.sort_index() # sort the data frame by index, so that you get the same order of names
.drop(columns='Diff_from_0') # drop the temporary column, we do not need it anymore
)
For the sake of completeness, I created df
in the following way:
df = pd.DataFrame({'Day': [23, 5, -345, 84, 2, -83, -123, 344, 238, 47, 543, 134, -3]},
index=['Andrew', 'Andrew', 'Andrew', 'Andrew', 'Bess', 'Bess', 'Bess', 'Derek', 'Derek', 'Leslie', 'Leslie', 'Leslie', 'Leslie'])
Upvotes: 0