kishore
kishore

Reputation: 3

keep rows with latest date for matching ids. Remove all rows with old dates for matching ids

From data frame ignore hosts only with column:STATUS in "decommissioned".

For same hosts with status in both "decommissioned" and "active" state select rows with LAST_MODIFIED greater than others.

Input.csv

LAST_MODIFIED,       HOST,   STATUS
7/01/2019 10:20:00,  host1,  decommissioned
6/01/2019 02:10:02,  host1,  active
6/01/2019 02:10:02,  host1,  active
5/01/2019 02:10:02,  host1,  decommissioned
6/20/2019 10:20:00,  host2,  active
6/10/2019 01:20:02,  host3,  decommissioned
6/01/2019 02:10:00,  host3,  decommissioned

output.csv

LAST_MODIFIED,       HOST,   STATUS
7/01/2019 10:20:00,  host1,  decommissioned
6/20/2019 10:20:00,  host2,  active

Upvotes: 0

Views: 84

Answers (1)

ALollz
ALollz

Reputation: 59549

Create a mask using transform + any to find groups that have at least one active element. Mask then sort + groupby + tail to get the greatest 'LAST_MODIFIED' row.

#df['LAST_MODIFIED'] = pd.to_datetime(df.LAST_MODIFIED)

m = df.STATUS.eq('active').groupby(df.HOST).transform('any')
res = df[m].sort_values('LAST_MODIFIED').groupby('HOST').tail(1)

Ouput: res:

        LAST_MODIFIED   HOST          STATUS
4 2019-06-20 10:20:00  host2          active
0 2019-07-01 10:20:00  host1  decommissioned

Upvotes: 1

Related Questions