Simone Di Claudio
Simone Di Claudio

Reputation: 131

Pandas replacing values of duplicated ID with values of the first occurrence

let's say I have the following dataframe:

df = pd.DataFrame({"ID":['1','4','2','3','4','3','4'],"Color":['Green','Org','Yellow','B','Orange','Blue','Orange']})

what I am trying to do is to replace the value in the col ['Color'] for the records with a duplicated ['ID']. The ['Color'] value has to be replaced with ['Color'] value in the first occurrence of the ['ID']. So that the above dataframe will result in the following:

df = pd.DataFrame({"ID":['1','4','2','3','4','3','4'],"Color":['Green','Org','Yellow','B','Org','B','Org']})

As you can see, for istance, 'Orange' has been replaced with 'Org' since this one was the ['Color'] value at the first occurrence of ['ID']= '4'

The original datset is (8M,140) so I can't drop the duplicates as I will lose important data. I struggle to find a logic that basically links duplicated() and replace(). I have searched the internet but all I can find are solution to replace duplicates with NaN or the same value for all. If I missed the post where this problem has been addressed please direct me to it.

Thank you in advance for you time and help.

Upvotes: 0

Views: 429

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150735

You can use duplicated to extract the first occurrences and then map:

s = ~df.ID.duplicated()
df['Color'] = df.ID.map(df[s].set_index('ID')['Color'])

Output:

  ID   Color
0  1   Green
1  4     Org
2  2  Yellow
3  3       B
4  4     Org
5  3       B
6  4     Org

Upvotes: 2

Related Questions