finding duplicate rows in pandas based on approximate match or formula

Question

I have a pandas data frame

import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
                   "name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))

      x      y  name
0  1.00  10.00  0ndx
1  1.00  11.00  1ndx
2  2.00  12.00  2ndx
3  3.00  12.95  3ndx
4  3.01  13.00  4ndx
5  4.00  11.00  5ndx
6  5.00  10.00  6ndx

I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.

This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.

Is there a way to do this?

finding duplicate rows in pandas based on approximate match or formula

Answers (1)

Related Questions