Reputation: 451
There are several columns in the data, three are named "candidate_id", "enddate", "TitleLevel".
Within the same id, if the enddate is the same, I will delete the lower level record.
For example, given:
candidate_id startdate enddate TitleLevel
1 2012.1.1 2013.5.1 2
1 2011.1.1 2013.5.1 4
1 2008.12.1 2010.1.1 3
2 2010.10.1 2012.12.1 2
What I want is:
candidate_id startdate enddate TitleLevel
1 2011.1.1 2013.5.1 4
1 2008.12.1 2010.1.1 3
2 2010.10.1 2012.12.1 2
I will delete candidate_id=1
, enddate=2013.5.1
, and titlelevel=2
.
I have come up with a loop.
for i in range(nrow-2,-1, -1):
if (JobData['enddate'][i] == JobData['enddate'][i+1]
and JobData['candidate_id'][i] == JobData['candidate_id'][i+1]
and pd.notnull(JobData['enddate'][i]):
if JobData['TitleLevel'][i] > JobData['TitleLevel'][i+1]:
JobData= JobData.drop(i+1)
else:
JobData= JobData.drop(i)
The loop really takes some time to delete redundant rows. Is there a faster method?
Upvotes: 5
Views: 380
Reputation: 375445
Assuming that data is sorted by startdate (at least within each group), you can use groupby last:
In [11]: df.groupby(['candidate_id', 'enddate'], as_index=False).last()
Out[11]:
candidate_id enddate startdate TitleLevel
0 1 2010.1.1 2008.12.1 3
1 1 2013.5.1 2011.1.1 4
2 2 2012.12.1 2010.10.1 2
Upvotes: 1
Reputation: 48317
If you data structure is exactly as you describe, you can use groupby
/max
:
>>> df
candidate_id enddate TitleLevel
0 1 2013.5.1 2
1 1 2013.5.1 4
2 1 2010.1.1 3
3 2 2012.12.1 2
>>> df.groupby(['candidate_id','enddate']).max().reset_index()
candidate_id enddate TitleLevel
0 1 2010.1.1 3
1 1 2013.5.1 4
2 2 2012.12.1 2
Here groupby groups rows with equal candidate_id
and enddate
, and max()
evaluates maximum TitleLevel
within each group. Result is the same as if rows with all other values being dropped.
In case you have more columns,
>>> df
candidate_id enddate TitleLevel other_column
0 1 2013.5.1 2 foo
1 1 2013.5.1 4 bar
2 1 2010.1.1 3 foobar
3 2 2012.12.1 2 barfoo
you can get idexes of rows with max values, without sorting if rows order has to be preserved:
>>> idx = df.groupby(['candidate_id','enddate'], sort=False)['TitleLevel'].agg(lambda x: x.idxmax())
and filter needed rows with ix
:
>>> df.ix[idx]
candidate_id enddate TitleLevel other_column
1 1 2013.5.1 4 bar
2 1 2010.1.1 3 foobar
3 2 2012.12.1 2 barfoo
Upvotes: 2