Reputation: 125
I have a csv file with multiple columns of which I want to return these 4 columns. There exist duplicates in the Case_ID column with a different start and or completion time. The goal is to return the case_id with the min start_time and max completion_time and the corresponding casetype.
Case_ID Start_time Completion_time Casetype
BZ-06-0501-001 2006-10-24 2007-06-15 'string'
BZ-06-0507-001 2006-10-30 2007-06-18 'string'
BZ-06-0508-001 2006-10-30 2008-09-08 'string'
BZ-06-0509-001 2006-10-30 2008-09-08 'string'
This is the code I use
def f():
appeals = pd.read_csv('appeals')
min_max = appeals[['Case_ID','Start_time','Completion_time']].groupby('Case_ID').agg({'Start_time' : ['min'], 'Completion_time' : ['max']}).assign(Casetype = appeals['Casetype'])
return min_max
The problem is that the casetype column returns as nan. So the column the values in the column are not seen as a string anymore. Did I use assign() wrongly?
Upvotes: 1
Views: 38
Reputation: 1249
One possible solution:
min_max = appeals.groupby('Case_ID').agg({'Start_time' : 'min', 'Completion_time' : 'max', 'Casetype': 'first'}).reset_index()
I removed your selection of columns ([['Case_ID','Start_time','Completion_time']]
) in order to keep all the columns, including Casetype
. Then, I added Casetype
in the aggregation, with the selection of the first occurence, as you said all Casetype
are the same for a given Case_ID
. Then, I removed the .assign()
.
Upvotes: 2