naveen kumar
naveen kumar

Reputation: 197

Find duplicate values and label value by comparing with previous years in pandas data frame

I have a data frame where I need to identify entries which are repeating from previous years.

Input:

df1 = pd.DataFrame({'type': ['cst1', 'cst1', 'cst2','cst1','cst2','cst3','cst2','cst1','cst2','cst4','cst5','cst3'],
                   'year': [2017, 2017, 2017,2018,2018,2018,2018,2019,2019,2019,2019,2020]})

    type  year
0   cst1  2017
1   cst1  2017
2   cst2  2017
3   cst1  2018
4   cst2  2018
5   cst3  2018
6   cst2  2018
7   cst1  2019
8   cst2  2019
9   cst4  2019
10  cst5  2019
11  cst3  2020

From the above data fame compare type year wise and identify entries which are not new.

ex: first 2017 since it is starting year all entries are considered new, when identifying duplicates in 2018 need to compare with all entries of 2017 cst1 and cst2 are duplicates. 2019 should include all entries of 2018 and 2017 to identify duplicates.

output:
    type  year  status
0   cst1  2017       0
1   cst1  2017       0
2   cst2  2017       0
3   cst1  2018       1
4   cst2  2018       1
5   cst3  2018       0
6   cst2  2018       1
7   cst1  2019       1
8   cst2  2019       1
9   cst4  2019       0
10  cst5  2019       0
11  cst3  2020       1

in the output for 2020 cst3 is identified as duplicate even though 2019 doesn't not contain type cst3. while comparing each increasing year need to consider all the passing years to identify duplicates here 2018 have type cst3 so that it is identified as duplicate and labeled as 1

Upvotes: 0

Views: 260

Answers (1)

Pablo C
Pablo C

Reputation: 4761

You can get the minimum year per group and then check if the rows in your data frame are in those minimums:

pd.merge(df1, df1.groupby("type").min().reset_index(), "outer", indicator = "status")\
    .replace({"status": {"both": 0, "left_only": 1}}) 

Output

    type  year  status
0   cst1  2017       0
1   cst1  2017       0
2   cst2  2017       0
3   cst1  2018       1
4   cst2  2018       1
5   cst2  2018       1
6   cst3  2018       0
7   cst1  2019       1
8   cst2  2019       1
9   cst4  2019       0
10  cst5  2019       0
11  cst3  2020       1

DOCS

  1. pandas.DataFrame.groupby
  2. pandas.merge
  3. pandas.DataFrame.replace

Upvotes: 2

Related Questions