Reputation: 951
I'm cleaning up a dataframe to train a machine learning model and I found that some entries have two different values in one column. For example:
A | B |
---|---|
1234 | foo |
1234 | bar |
Since the value in column A is 1234
for both entries, the value in column B should be foo
(or bar
) in both cases.
I tried a brute force approach to this:
for index1, row1 in df.iterrows():
for index2, row2 in df.iterrows():
if (row1['A'] == row2['A']) and ((row1['B'] != row2['B'])):
print(f'Found duplicated A with different B!')
row1['B'] == row2['B']
row1['C'] == row2['C'] == False
But probably there is an easier way to do this that I can't see. Does pandas have any methods to deal with this?
Upvotes: 0
Views: 84
Reputation: 150745
You can use groupby.transform('first')
(or 'last'
):
df['B'] = df.groupby('A')['B'].transform('first')
Upvotes: 2