user3468356
user3468356

Reputation: 59

How to remove duplicate values in a row?

I have a dataframe with thousands of rows; in each row, some values are duplicated. I'm looking to remove these duplicates and only keep unique values.

To illustrate, this is the raw data:

  Column 1 Column 2 Column 3
0        A        B        A
1        D        C        C
2        E        E        E
3        F        G        H

into:

  Column 1 Column 2 Column 3
0        A        B        
1        D        C        
2        E                
3        F        G        H

I've tried applying df.drop_duplicates but it drops duplicate values across columns rather than rows.

Upvotes: 2

Views: 542

Answers (1)

P.Tillmann
P.Tillmann

Reputation: 2110

You can apply drop_duplicates row-wise:

df.apply(lambda x: x.drop_duplicates(), axis=1)

  Column_1 Column_2 Column_3
0        A        B      NaN
1        D        C      NaN
2        E      NaN      NaN
3        F        G        H

EDIT:

Based on the suggestion from DeepSpace I did some timings to see if applying pd.Series.drop_duplicates would be faster than using a lambda.

df = pd.DataFrame({'Column_1':np.random.randint(1,10,10**4),
                   'Column_2':np.random.randint(1,10,10**4),
                   'Column_3':np.random.randint(1,10,10**4)})

%timeit df.apply(lambda x: x.drop_duplicates(), axis=1)
435 ms ± 8.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.apply(pd.Series.drop_duplicates, axis=1)
443 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Unfortunately both solutions are rather slow (1000 rows already taking half a second).

Upvotes: 11

Related Questions