Reputation: 1463
I have a pandas data frame
import pandas as pd
df = pd.DataFrame({"x" : [1.,1.,2.,3.,3.01,4.,5.],"y":[10.,11.,12.,12.95,13.0,11.,10.],
"name":["0ndx","1ndx","2ndx","3ndx","4ndx","5ndx","6ndx"]})
print(df.duplicated(subset=["x","y"]))
x y name
0 1.00 10.00 0ndx
1 1.00 11.00 1ndx
2 2.00 12.00 2ndx
3 3.00 12.95 3ndx
4 3.01 13.00 4ndx
5 4.00 11.00 5ndx
6 5.00 10.00 6ndx
I would like to find duplicate rows (in this case rows 3 and 4) using a formula based on distance with a tolerance of say 0.1. A row would be duplicated if it is is within a distance 0.1 of another row (or, equivalently if both x and y are within a tolerance). As one commenter pointed out, this could lead to a cluster of values with more than 0.1 of spread as 1.1 is close to 1.18 is close to 1.22. This might affect some of the things you can do, but I would still define any row that is within the tolerance of another as duplicated.
This is a toy problem I have a modest size problem but foresee problems of large enough size (250,000 rows) that the outer product might be expensive to construct.
Is there a way to do this?
Upvotes: 0
Views: 420
Reputation: 86
you can compare with pandas.shift https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html.
Then if you wanted to compare each row to the previous, and make a column where they are within some threshold of each-other, let's say 0.1 it would follow:
eps = 0.1
df['duplicated'] = 0
df.sort_values(by=['x'],inplace=True)
df.loc[abs(df['x'] - df['x'].shift()) <= eps,'duplicated'] = 1
Then columns with a 1 would be those that are duplicated within your threshold.
Upvotes: 1