Matthias Gallagher
Matthias Gallagher

Reputation: 471

Keep row that the element is closest to 0 in specific column

I'm trying to keep rows with a condition. There are two columns in my sample dataframe. Name column contains duplicated names and the Day column contains days in timedelta64[ns].

df

Name     Day
Andrew   23 days
Andrew   5 days
Andrew   -345 days
Andrew   84 days
Bess     2 days
Bess     -83 days
Bess     -123 days
Derek    344 days
Derek    238 days
Leslie   47 days
Leslie   543 days
Leslie   134 days
Leslie   -3 days

The operation would keep the element Day column that is closest to 0 by each name in the Name column.

Desired df

Name     Day
Andrew   5 days
Bess     2 days
Derek    238 days
Leslie   -3 days

Any help would be appreciated! Thanks in advance!!

Upvotes: 1

Views: 125

Answers (4)

wwnde
wwnde

Reputation: 26676

Here is another way. groupby() name, find absolute difference, and select smallest using nsmallest(1)

 df.groupby('Name')['Day'].apply(lambda x: abs(x).nsmallest(1))



Name      
Andrew  1      5 days
Bess    4      2 days
Derek   8    238 days
Leslie  12     3 days

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71687

Take the abs value of column Day and use Series.groupby to group the column on Name and use aggregation function idxmin to get the indices of values closest to zero, finally use Series.map to map the indices to thie corresponding timedelta values:

s = df['Day'].abs().groupby(df['Name']).idxmin().map(df['Day'])

Result:

# print(s)

Name
Andrew     5 days
Bess       2 days
Derek    238 days
Leslie    -3 days
Name: Day, dtype: timedelta64[ns]

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28709

Since the aim is to look for values closed to 0, get the absolute value of each entry in the Day column, look for the minimum index and get the values on those indices. The idea is from a solution by @Ch3ster

df.groupby("Name").agg(lambda x: x[x.abs().idxmin()])

Name
Andrew      5
Bess        2
Derek     238
Leslie     -3
dtype: int64

Upvotes: 1

Garrus990
Garrus990

Reputation: 98

I'd go for something like this, assuming that your data frame is called df. The solution is still valid, even if you have more columns in your data frame.

import numpy as np

(df
 .assign(Diff_from_0=np.abs(df['Day']-0))  # assign a temporary column for the calculations
 .sort_values('Diff_from_0')  # sort values by our temporary column (ascending)
 .groupby(level=0)  # group by the index (names)
 .head(1)  # get first row (lowest diff from 0) for each unique index value
 .sort_index()  # sort the data frame by index, so that you get the same order of names
 .drop(columns='Diff_from_0')  # drop the temporary column, we do not need it anymore
)

For the sake of completeness, I created df in the following way:

df = pd.DataFrame({'Day': [23, 5, -345, 84, 2, -83, -123, 344, 238, 47, 543, 134, -3]}, 
                 index=['Andrew', 'Andrew', 'Andrew', 'Andrew', 'Bess', 'Bess', 'Bess', 'Derek', 'Derek', 'Leslie', 'Leslie', 'Leslie', 'Leslie'])

Upvotes: 0

Related Questions