Reputation: 85
I am writing a Python program for finding areas of interest on a page. The positions on the page of all values of interest are given to me, but some values (typically only one or two) are far away from the others and I'd like to remove these. The data set is not huge, less than 100 data points but I will need to do this many times.
I have a cartesian coordinate system on two axes (x and y) in the first quadrant, so only positive values.
My data points represent boxes drawn on this coordinate system, which I have stored as a set of two coordinate pairs in a tuple. A box can be drawn by two coordinate pairs since all lines are straight. Example: (8, 2, 15, 10) would draw a box with indices (x,y) = (8,2), (8,10), (15,10) and (15,2).
I am trying to remove the outliers in this set but am having a hard time trying to figure out a good approach. I have thought about removing the outliers by finding the IQR and removing all points which fulfill these criteria:
Q1 - 1.5 * IQR or
Q3 + 1.5 * IQR
The problem here is that I am having a hard time figuring out how because the values are not just coordinates but areas if you will. However they are overlapping so they don't fit well in a histogram either.
First I thought I might add a point for each whole value that the box spans, the example box would in that case create 56 points. It seems to me as if this solution is quite bad. Does anyone have any alternative solutions?
Upvotes: 1
Views: 5214
Reputation: 6528
Mainly there are two approaches: either you fixe the threshold value or you let machine learning infer it for you.
For machine learning, you can use Isolation Forest.
If you don't want ML then you have to fix yourself the threshold. So you can use a norm. There is no.linalg.norm(p1 - p2)
or if you want more control on the metric there is cdist:
scipy.spatial.distance.cdist(p1, p2)
Upvotes: 0