Ben
Ben

Reputation: 153

Collapsing list to unique IDs with a range of dates

I have a large list of IDs that repeat with different ranges of dates. I need to create a unique list of IDs with just one range of dates that includes the earliest start date and latest end date from the uncollapsed list.

this is an example of what I have:

    id  start_date  end_date
    1   9/25/2015   10/12/2015
    1   9/16/2015   11/1/2015
    1   8/25/2015   9/21/2015
    2   9/2/2015    10/29/2015
    3   9/18/2015   10/15/2015
    3   9/19/2015   9/30/2015
    4   8/27/2015   9/15/2015

And this is what I need.

   id   start_date  end_date
   1    8/25/2015   11/1/2015
   2    9/2/2015    10/29/2015
   3    9/18/2015   10/15/2015
   4    8/27/2015   9/15/2015 

I'm trying to get this in Python, but not having much luck. Thanks!

Upvotes: 1

Views: 54

Answers (1)

unutbu
unutbu

Reputation: 880269

Use groupby/aggregate:

In [12]: df.groupby('id').agg({'start_date':min, 'end_date':max})
Out[12]: 
   start_date   end_date
id                      
1  2015-08-25 2015-11-01
2  2015-09-02 2015-10-29
3  2015-09-18 2015-10-15
4  2015-08-27 2015-09-15

Note that it is important that start_date and end_date be parsed as dates, so that min and max return the minimum and maximum dates for each id. If the values are merely string representations of dates, then min and max would give the string min or max which depends on string lexicographic order. If the date-strings were in YYYY/MM/DD format, then lexicographic order would correspond to parsed-date order, but date-strings in the MM/DD/YYYY format do not have this property.

If start_date and end_date have string values, then

for col in ['start_date', 'end_date']:
    df[col] = pd.to_datetime(df[col])

would convert the strings into dates.

If you are loading the DataFrame from a file using pd.read_table (or pd.read_csv), then

df = pd.read_table(filename, ..., parse_dates=[1, 2])

would parse the strings in the second and third columns of the file as dates. [1, 2] corresponds to the second and third columns since Python uses 0-based indexing.

Upvotes: 2

Related Questions