Reputation: 31
I have 4 columns of data A, B, C, D. All the data are repeating for example row 1: P1 XX P1 LU is repeating in row 11: P1 LU P1 XX. Can anyone help me to remove the repeating units from Pandas dataframe?
Input:
A B C D
P1 XX P1 LU
P1 XX P9 LU
P1 XX P3 LU
P10 XX P10 LU
P2 XX P8 LU
P4 XX P1 LU
P6 XX P8 LU
P7 XX P10 LU
P9 XX P3 LU
P1 LU P4 XX
P1 LU P1 XX
P10 LU P7 XX
P10 LU P10 XX
P3 LU P1 XX
P3 LU P9 XX
P8 LU P2 XX
P8 LU P6 XX
P9 LU P1 XX
The Output should look like (or vice-versa):
A B C D
P1 XX P1 LU
P1 XX P9 LU
P2 XX P8 LU
P7 XX P10 LU
P9 XX P3 LU
P1 LU P4 XX
P10 LU P7 XX
P3 LU P1 XX
P8 LU P6 XX
Thanks a ton in advance
Upvotes: 2
Views: 118
Reputation: 23099
It sounds like you have rows of data that can come in any order, and you need to de-dupe irrespective of that order.
To achieve this, we can use pd.factorize
and turn the objects into numerical variables, then apply a sum row-wise.
Using apply
here would be wasteful, so let's re-shape your dataframe a little.
s = df.stack().to_frame(0)
df1 = df.assign(
ky=df.index.map(
s.assign(ky=s[0].factorize()[0]
).groupby(level=[0]).sum()['ky']))
print(df1)
A B C D ky
0 P1 XX P1 LU 3
1 P1 XX P9 LU 6
2 P1 XX P3 LU 7
3 P10 XX P10 LU 13
4 P2 XX P8 LU 16
5 P4 XX P1 LU 11
6 P6 XX P8 LU 19
7 P7 XX P10 LU 18
8 P9 XX P3 LU 10
9 P1 LU P4 XX 11
10 P1 LU P1 XX 3
11 P10 LU P7 XX 18
12 P10 LU P10 XX 13
13 P3 LU P1 XX 7
14 P3 LU P9 XX 10
15 P8 LU P2 XX 16
16 P8 LU P6 XX 19
17 P9 LU P1 XX 6
you can now de-dupe based on your ky
column.
print(df1.drop_duplicates(subset=['ky'],keep='first'))
A B C D ky
0 P1 XX P1 LU 3
1 P1 XX P9 LU 6
2 P1 XX P3 LU 7
3 P10 XX P10 LU 13
4 P2 XX P8 LU 16
5 P4 XX P1 LU 11
6 P6 XX P8 LU 19
7 P7 XX P10 LU 18
8 P9 XX P3 LU 10
or inspect the duplicates.
print(df1[df1.duplicated(subset='ky',keep=False)].sort_values('ky'))
A B C D ky
0 P1 XX P1 LU 3
10 P1 LU P1 XX 3
17 P9 LU P1 XX 6
1 P1 XX P9 LU 6
13 P3 LU P1 XX 7
2 P1 XX P3 LU 7
14 P3 LU P9 XX 10
8 P9 XX P3 LU 10
5 P4 XX P1 LU 11
9 P1 LU P4 XX 11
3 P10 XX P10 LU 13
12 P10 LU P10 XX 13
4 P2 XX P8 LU 16
15 P8 LU P2 XX 16
7 P7 XX P10 LU 18
11 P10 LU P7 XX 18
6 P6 XX P8 LU 19
16 P8 LU P6 XX 19
Upvotes: 3
Reputation: 9207
Normally you would use:
df_1.drop_duplicates()
Which doesn't work here, because the values you consider duplicated are not ordered in the row, so you have to order them first:
df_1[df_1.apply(sorted, axis=1).map(str).duplicated(keep='last')]
A B C D
0 P1 XX P1 LU
1 P1 XX P9 LU
2 P1 XX P3 LU
3 P10 XX P10 LU
4 P2 XX P8 LU
5 P4 XX P1 LU
6 P6 XX P8 LU
7 P7 XX P10 LU
8 P9 XX P3 LU
In case you only want some of the rows to you for dupliacte check:
df_1[df_1[['A', 'C']].apply(sorted, axis=1).map(str).duplicated()]
A B C D
9 P1 LU P4 XX
10 P1 LU P1 XX
11 P10 LU P7 XX
12 P10 LU P10 XX
13 P3 LU P1 XX
14 P3 LU P9 XX
15 P8 LU P2 XX
16 P8 LU P6 XX
17 P9 LU P1 XX
Upvotes: 2