Gokul Prasanth
Gokul Prasanth

Reputation: 81

Finding duplicates and creating sets in pandas

Input:

Part no A B C D
A1 0.25 0.2 0.3 0.4
A2 0.26 0.3 0.3 0.4
A3 0.3 0.3 0.3 0.3
A4 0.7 0.3 0.3 0.3
A5 0.8 0.4 0.45 0.46

I have to create set for duplicates on the column A with the tolerance of +/-0.1

Expected output

Part no A B C D Set
A1 0.25 0.2 0.3 0.4 1
A2 0.26 0.3 0.3 0.4 1
A3 0.3 0.3 0.3 0.3 1
A4 0.7 0.3 0.3 0.3 2
A5 0.8 0.4 0.45 0.46 2

Upvotes: 1

Views: 71

Answers (2)

Amuoeba
Amuoeba

Reputation: 786

Would something like this work for you? The last example of 0.7 and 0.8 is a bit strange anyways. Why would you like them in the same group?

Reference to my_ciel answer since nor python nor numpy has the builtin functionality to do a cieling round on arbitrary float positions: my_ciel answer

import numpy as np
import pandas as pd


# Some boilerplate to make your sample data reusable (I suggest u post your questions in such way
# that data can be easily coppied)
a = np.array([0.25,0.2,0.3,0.4,0.26,0.3,0.3,0.4,0.3,0.3,0.3,0.3,0.7,0.3,0.3,0.3,0.8,0.4,0.45,0.46,])
a = a.reshape(5,4)

df = pd.DataFrame(a)
df.columns = ["A","B","C","D"]
df["PartNo"] = ["A1","A2","A3","A4","A5"]
df = df[["PartNo","A","B","C","D"]]


# This does a cieling round at the desired float position 
def my_ceil(a, precision=0):
    return np.true_divide(np.ceil(a * 10**precision), 10**precision)


# Tolerance can be set here:
# 1 = 0.1 , 2 = 0.01 , 3 = 0.001 etc.
df["set"] = my_ceil(df["A"],1)
df["set"] = df["set"].astype('category').cat.codes
df

Upvotes: 1

Corralien
Corralien

Reputation: 120401

Use cumsum to create groups:

# If your dataframe is not sorted by 'A' columns
df = df.sort_values('A')

df['Set'] = df['A'].sub(df['A'].shift()).abs().ge(0.1000000001).cumsum().add(1)
>>> df
  Part no     A    B     C     D  Set
0      A1  0.25  0.2  0.30  0.40    1
1      A2  0.26  0.3  0.30  0.40    1
2      A3  0.30  0.3  0.30  0.30    1
3      A4  0.70  0.3  0.30  0.30    2
4      A5  0.80  0.4  0.45  0.46    2

0.1000000001 is due to float precision. You can also use np.isclose.

With np.close:

>>> df['Set'] = np.cumsum(~np.isclose(df['A'], df['A'].shift(), atol=0.1))

  Part no     A    B     C     D  Set
0      A1  0.25  0.2  0.30  0.40    1
1      A2  0.26  0.3  0.30  0.40    1
2      A3  0.30  0.3  0.30  0.30    1
3      A4  0.70  0.3  0.30  0.30    2
4      A5  0.80  0.4  0.45  0.46    2

Upvotes: 2

Related Questions