Reputation: 153
I have a large list of IDs that repeat with different ranges of dates. I need to create a unique list of IDs with just one range of dates that includes the earliest start date and latest end date from the uncollapsed list.
this is an example of what I have:
id start_date end_date
1 9/25/2015 10/12/2015
1 9/16/2015 11/1/2015
1 8/25/2015 9/21/2015
2 9/2/2015 10/29/2015
3 9/18/2015 10/15/2015
3 9/19/2015 9/30/2015
4 8/27/2015 9/15/2015
And this is what I need.
id start_date end_date
1 8/25/2015 11/1/2015
2 9/2/2015 10/29/2015
3 9/18/2015 10/15/2015
4 8/27/2015 9/15/2015
I'm trying to get this in Python, but not having much luck. Thanks!
Upvotes: 1
Views: 54
Reputation: 880269
Use groupby/aggregate
:
In [12]: df.groupby('id').agg({'start_date':min, 'end_date':max})
Out[12]:
start_date end_date
id
1 2015-08-25 2015-11-01
2 2015-09-02 2015-10-29
3 2015-09-18 2015-10-15
4 2015-08-27 2015-09-15
Note that it is important that start_date
and end_date
be parsed as dates, so that min
and max
return the minimum and maximum dates for each id
. If the values are merely string representations of dates, then min
and max
would give the string min or max which depends on string lexicographic order. If the date-strings were in YYYY/MM/DD
format, then lexicographic order would correspond to parsed-date order, but date-strings in the MM/DD/YYYY
format do not have this property.
If start_date
and end_date
have string values, then
for col in ['start_date', 'end_date']:
df[col] = pd.to_datetime(df[col])
would convert the strings into dates.
If you are loading the DataFrame from a file using pd.read_table
(or pd.read_csv
), then
df = pd.read_table(filename, ..., parse_dates=[1, 2])
would parse the strings in the second and third columns of the file as dates. [1, 2]
corresponds to the second and third columns since Python uses 0-based indexing.
Upvotes: 2