Reputation: 59
I have a dataframe with thousands of rows; in each row, some values are duplicated. I'm looking to remove these duplicates and only keep unique values.
To illustrate, this is the raw data:
Column 1 Column 2 Column 3
0 A B A
1 D C C
2 E E E
3 F G H
into:
Column 1 Column 2 Column 3
0 A B
1 D C
2 E
3 F G H
I've tried applying df.drop_duplicates but it drops duplicate values across columns rather than rows.
Upvotes: 2
Views: 542
Reputation: 2110
You can apply drop_duplicates row-wise:
df.apply(lambda x: x.drop_duplicates(), axis=1)
Column_1 Column_2 Column_3
0 A B NaN
1 D C NaN
2 E NaN NaN
3 F G H
EDIT:
Based on the suggestion from DeepSpace I did some timings to see if applying pd.Series.drop_duplicates would be faster than using a lambda.
df = pd.DataFrame({'Column_1':np.random.randint(1,10,10**4),
'Column_2':np.random.randint(1,10,10**4),
'Column_3':np.random.randint(1,10,10**4)})
%timeit df.apply(lambda x: x.drop_duplicates(), axis=1)
435 ms ± 8.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.apply(pd.Series.drop_duplicates, axis=1)
443 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Unfortunately both solutions are rather slow (1000 rows already taking half a second).
Upvotes: 11