Akash Deep Biswas
Akash Deep Biswas

Reputation: 31

Remove duplicated rows irrespective of item order pandas

I have 4 columns of data A, B, C, D. All the data are repeating for example row 1: P1 XX P1 LU is repeating in row 11: P1 LU P1 XX. Can anyone help me to remove the repeating units from Pandas dataframe?

Input:

A    B    C    D

P1   XX   P1   LU

P1   XX   P9   LU

P1   XX   P3   LU

P10  XX   P10  LU

P2   XX   P8   LU

P4   XX   P1   LU

P6   XX   P8   LU

P7   XX   P10  LU

P9   XX   P3   LU

P1   LU   P4   XX

P1   LU   P1   XX

P10  LU   P7   XX

P10  LU   P10  XX

P3   LU   P1   XX

P3   LU   P9   XX

P8   LU   P2   XX

P8   LU   P6   XX

P9   LU   P1   XX

The Output should look like (or vice-versa):

A    B    C    D

P1   XX   P1   LU

P1   XX   P9   LU

P2   XX   P8   LU

P7   XX   P10  LU

P9   XX   P3   LU

P1   LU   P4   XX

P10  LU   P7   XX

P3   LU   P1   XX

P8   LU   P6   XX

Thanks a ton in advance

Upvotes: 2

Views: 118

Answers (2)

Umar.H
Umar.H

Reputation: 23099

It sounds like you have rows of data that can come in any order, and you need to de-dupe irrespective of that order.

To achieve this, we can use pd.factorize and turn the objects into numerical variables, then apply a sum row-wise.

Using apply here would be wasteful, so let's re-shape your dataframe a little.

s = df.stack().to_frame(0)

df1 = df.assign(
            ky=df.index.map(
                s.assign(ky=s[0].factorize()[0]
                        ).groupby(level=[0]).sum()['ky']))

print(df1)

    A   B    C   D  ky
0    P1  XX   P1  LU   3
1    P1  XX   P9  LU   6
2    P1  XX   P3  LU   7
3   P10  XX  P10  LU  13
4    P2  XX   P8  LU  16
5    P4  XX   P1  LU  11
6    P6  XX   P8  LU  19
7    P7  XX  P10  LU  18
8    P9  XX   P3  LU  10
9    P1  LU   P4  XX  11
10   P1  LU   P1  XX   3
11  P10  LU   P7  XX  18
12  P10  LU  P10  XX  13
13   P3  LU   P1  XX   7
14   P3  LU   P9  XX  10
15   P8  LU   P2  XX  16
16   P8  LU   P6  XX  19
17   P9  LU   P1  XX   6

you can now de-dupe based on your ky column.

print(df1.drop_duplicates(subset=['ky'],keep='first'))


  A   B    C   D  ky
0   P1  XX   P1  LU   3
1   P1  XX   P9  LU   6
2   P1  XX   P3  LU   7
3  P10  XX  P10  LU  13
4   P2  XX   P8  LU  16
5   P4  XX   P1  LU  11
6   P6  XX   P8  LU  19
7   P7  XX  P10  LU  18
8   P9  XX   P3  LU  10

or inspect the duplicates.

print(df1[df1.duplicated(subset='ky',keep=False)].sort_values('ky'))

      A   B    C   D  ky
0    P1  XX   P1  LU   3
10   P1  LU   P1  XX   3
17   P9  LU   P1  XX   6
1    P1  XX   P9  LU   6
13   P3  LU   P1  XX   7
2    P1  XX   P3  LU   7
14   P3  LU   P9  XX  10
8    P9  XX   P3  LU  10
5    P4  XX   P1  LU  11
9    P1  LU   P4  XX  11
3   P10  XX  P10  LU  13
12  P10  LU  P10  XX  13
4    P2  XX   P8  LU  16
15   P8  LU   P2  XX  16
7    P7  XX  P10  LU  18
11  P10  LU   P7  XX  18
6    P6  XX   P8  LU  19
16   P8  LU   P6  XX  19

Upvotes: 3

Andreas
Andreas

Reputation: 9207

Normally you would use:

df_1.drop_duplicates()

Which doesn't work here, because the values you consider duplicated are not ordered in the row, so you have to order them first:

df_1[df_1.apply(sorted, axis=1).map(str).duplicated(keep='last')]

     A   B    C   D
0   P1  XX   P1  LU
1   P1  XX   P9  LU
2   P1  XX   P3  LU
3  P10  XX  P10  LU
4   P2  XX   P8  LU
5   P4  XX   P1  LU
6   P6  XX   P8  LU
7   P7  XX  P10  LU
8   P9  XX   P3  LU

In case you only want some of the rows to you for dupliacte check:

df_1[df_1[['A', 'C']].apply(sorted, axis=1).map(str).duplicated()]

      A   B    C   D
9    P1  LU   P4  XX
10   P1  LU   P1  XX
11  P10  LU   P7  XX
12  P10  LU  P10  XX
13   P3  LU   P1  XX
14   P3  LU   P9  XX
15   P8  LU   P2  XX
16   P8  LU   P6  XX
17   P9  LU   P1  XX

Upvotes: 2

Related Questions