Eli S
Eli S

Reputation: 1463

finding duplicate rows in pandas based on approximate match or formula

I have a pandas data frame

import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
                   "name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))

      x      y  name
0  1.00  10.00  0ndx
1  1.00  11.00  1ndx
2  2.00  12.00  2ndx
3  3.00  12.95  3ndx
4  3.01  13.00  4ndx
5  4.00  11.00  5ndx
6  5.00  10.00  6ndx

I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.

This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.

Is there a way to do this?

Upvotes: 0

Views: 420

Answers (1)

ruby-lemon917
ruby-lemon917

Reputation: 86

you can compare with pandas.shift https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html.

Then if you wanted to compare each row to the previous, and make a column where they are within some threshold of each-other, let's say 0.1 it would follow:

eps = 0.1
df['duplicated'] = 0
df.sort_values(by=['x'],inplace=True)
df.loc[abs(df['x'] - df['x'].shift()) <= eps,'duplicated'] = 1

Then columns with a 1 would be those that are duplicated within your threshold.

Upvotes: 1

Related Questions