Reputation: 197
I have a data frame where I need to identify entries which are repeating from previous years.
Input:
df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst2','cst1','cst2','cst3','cst2','cst1','cst2','cst4','cst5','cst3'],
'year': [2017, 2017, 2017,2018,2018,2018,2018,2019,2019,2019,2019,2020]})
type year
0 cst1 2017
1 cst1 2017
2 cst2 2017
3 cst1 2018
4 cst2 2018
5 cst3 2018
6 cst2 2018
7 cst1 2019
8 cst2 2019
9 cst4 2019
10 cst5 2019
11 cst3 2020
From the above data fame compare type year wise and identify entries which are not new.
ex: first 2017 since it is starting year all entries are considered new, when identifying duplicates in 2018 need to compare with all entries of 2017 cst1 and cst2 are duplicates. 2019 should include all entries of 2018 and 2017 to identify duplicates.
output:
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst3 2018 0
6 cst2 2018 1
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
in the output for 2020 cst3 is identified as duplicate even though 2019 doesn't not contain type cst3. while comparing each increasing year need to consider all the passing years to identify duplicates here 2018 have type cst3 so that it is identified as duplicate and labeled as 1
Upvotes: 0
Views: 260
Reputation: 4761
You can get the minimum year per group and then check if the rows in your data frame are in those minimums:
pd.merge(df1, df1.groupby("type").min().reset_index(), "outer", indicator = "status")\
.replace({"status": {"both": 0, "left_only": 1}})
Output
type year status
0 cst1 2017 0
1 cst1 2017 0
2 cst2 2017 0
3 cst1 2018 1
4 cst2 2018 1
5 cst2 2018 1
6 cst3 2018 0
7 cst1 2019 1
8 cst2 2019 1
9 cst4 2019 0
10 cst5 2019 0
11 cst3 2020 1
DOCS
Upvotes: 2