Reputation: 11
I have a pandas dataframe that contains a start date and a measurement date.
Start Date change Individual measured_date
0 2004-11-23 3341 Bob 2007-07-26
1 2006-06-29 3398 Bob 2007-07-26
2 1997-07-21 2277 Greg 2005-04-21
3 2000-04-11 3380 Nancy 2005-10-14
4 2000-04-11 3380 Nancy 2007-06-28
5 2005-03-29 3115 Nancy 2005-10-14
6 2005-03-29 3115 Nancy 2007-06-28
7 2005-10-15 4294 Nancy 2007-06-28
8 2007-03-16 2163 Nancy 2007-06-28
9 2006-02-18 2299 Jose 2009-12-23
10 2008-11-16 1983 Jose 2009-12-23
11 2009-04-07 2112 Jose 2009-12-23
12 2009-11-14 2036 Jose 2009-12-23
13 2009-11-24 2556 Jose 2009-12-23
I want to select the rows that have the shortest time between the start and measurement time based on the column "Individual". I created a "diff" column for viewing purposes. For instance, this dataframe should be filtered to
StartDate change Individual measured_date diff
6/29/2006 3398 Bob 7/26/2007 392 days 00:00:00.000000000
7/21/1997 2277 Greg 4/21/2005 2831 days 00:00:00.000000000
3/16/2007 2163 Nancy 6/28/2007 104 days 00:00:00.000000000
11/24/2009 2556 Jose 12/23/2009 29 days 00:00:00.000000000`
Upvotes: 1
Views: 186
Reputation: 59264
You can groupby
and use nsmallest
df.groupby("Individual").diff.nsmallest(1)
Upvotes: 0
Reputation: 323226
Using sort_values
then we drop_duplicates
df.sort_values('diff',ascending=False).drop_duplicates('Individual')
Upvotes: 1