seorin_ak
seorin_ak

Reputation: 13

Pandas Identify duplicate records, create a new column and add the ID of first occurrence

I am a newbie in python, so please be mercy with me :)

Let's say, that there is a dataframe like this

ID       B        C       D        E        isDuplicated
1       Blue     Green   Blue     Pink           false
2       Red      Green   Red      Green          false
3       Red      Orange  Yellow   Green          false
4       Blue     Pink    Blue     Pink           false
5       Blue     Orange  Pink     Green          false
6       Blue     Orange  Pink     Green          true
7       Red      Orange  Yellow   Green          true
8       Red      Orange  Yellow   Green          true

If I have duplicates in the rows with the subset= B,C,D,E. Then I would like to add an other column 'firstOccurred', which should have the ID of the first occurrence. My desired dataframe should look like this:

ID       B        C       D        E        isDuplicated        firstOccurred
1       Blue     Green   Blue     Pink           false                         
2       Red      Green   Red      Green          false
3       Red      Orange  Yellow   Green          false
4       Blue     Pink    Blue     Pink           false
5       Blue     Orange  Pink     Green          false
6       Blue     Orange  Pink     Green          true               5
7       Red      Orange  Yellow   Green          true               3
8       Red      Orange  Yellow   Green          true               3

I would be grateful for any help! Thank you in advance!

Upvotes: 1

Views: 501

Answers (1)

jezrael
jezrael

Reputation: 862551

Use GroupBy.transform with first only for roww with True passed in numpy.where:

df['firstOccurred'] = np.where(df['isDuplicated'], 
                               df.groupby(['B','C','D','E'])['ID'].transform('first'), 
                               np.nan)
print (df)
   ID     B       C       D      E  isDuplicated  firstOccurred
0   1  Blue   Green    Blue   Pink         False            NaN
1   2   Red   Green     Red  Green         False            NaN
2   3   Red  Orange  Yellow  Green         False            NaN
3   4  Blue    Pink    Blue   Pink         False            NaN
4   5  Blue  Orange    Pink  Green         False            NaN
5   6  Blue  Orange    Pink  Green          True            5.0
6   7   Red  Orange  Yellow  Green          True            3.0
7   8   Red  Orange  Yellow  Green          True            3.0

Upvotes: 3

Related Questions