Finding duplicates and creating sets in pandas

Question

Input:

Part no	A	B	C	D
A1	0.25	0.2	0.3	0.4
A2	0.26	0.3	0.3	0.4
A3	0.3	0.3	0.3	0.3
A4	0.7	0.3	0.3	0.3
A5	0.8	0.4	0.45	0.46

I have to create set for duplicates on the column A with the tolerance of +/-0.1

Expected output

Part no	A	B	C	D	Set
A1	0.25	0.2	0.3	0.4	1
A2	0.26	0.3	0.3	0.4	1
A3	0.3	0.3	0.3	0.3	1
A4	0.7	0.3	0.3	0.3	2
A5	0.8	0.4	0.45	0.46	2

Corralien · Accepted Answer

Use cumsum to create groups:

# If your dataframe is not sorted by 'A' columns
df = df.sort_values('A')

df['Set'] = df['A'].sub(df['A'].shift()).abs().ge(0.1000000001).cumsum().add(1)

>>> df
  Part no     A    B     C     D  Set
0      A1  0.25  0.2  0.30  0.40    1
1      A2  0.26  0.3  0.30  0.40    1
2      A3  0.30  0.3  0.30  0.30    1
3      A4  0.70  0.3  0.30  0.30    2
4      A5  0.80  0.4  0.45  0.46    2

0.1000000001 is due to float precision. You can also use np.isclose.

With np.close:

>>> df['Set'] = np.cumsum(~np.isclose(df['A'], df['A'].shift(), atol=0.1))

  Part no     A    B     C     D  Set
0      A1  0.25  0.2  0.30  0.40    1
1      A2  0.26  0.3  0.30  0.40    1
2      A3  0.30  0.3  0.30  0.30    1
3      A4  0.70  0.3  0.30  0.30    2
4      A5  0.80  0.4  0.45  0.46    2

Finding duplicates and creating sets in pandas

Answers (2)

Related Questions