Reputation: 131
let's say I have the following dataframe:
df = pd.DataFrame({"ID":['1','4','2','3','4','3','4'],"Color":['Green','Org','Yellow','B','Orange','Blue','Orange']})
what I am trying to do is to replace the value in the col ['Color'] for the records with a duplicated ['ID']. The ['Color'] value has to be replaced with ['Color'] value in the first occurrence of the ['ID']. So that the above dataframe will result in the following:
df = pd.DataFrame({"ID":['1','4','2','3','4','3','4'],"Color":['Green','Org','Yellow','B','Org','B','Org']})
As you can see, for istance, 'Orange' has been replaced with 'Org' since this one was the ['Color'] value at the first occurrence of ['ID']= '4'
The original datset is (8M,140) so I can't drop the duplicates as I will lose important data. I struggle to find a logic that basically links duplicated() and replace(). I have searched the internet but all I can find are solution to replace duplicates with NaN or the same value for all. If I missed the post where this problem has been addressed please direct me to it.
Thank you in advance for you time and help.
Upvotes: 0
Views: 429
Reputation: 150735
You can use duplicated
to extract the first occurrences and then map
:
s = ~df.ID.duplicated()
df['Color'] = df.ID.map(df[s].set_index('ID')['Color'])
Output:
ID Color
0 1 Green
1 4 Org
2 2 Yellow
3 3 B
4 4 Org
5 3 B
6 4 Org
Upvotes: 2