Reputation: 112

Remove outliers from the lists of coordinates

I have 2 lists X and y with ten coordinate values.

I also have 2 additional lists for point outliers: outlier_x and outlier_y.

I want to go over my X and Y lists, compare their coordinates with outliers and create new 2 lists (X_new and y_new) which will include points without outliers.

There is my nested loops where I stacked. It records only first point and sims doesn't move to the next point. Can you please help to fix it?

X = dataset.iloc[:, 1].values
X = X.reshape(len(X),1)
y = dataset.iloc[:, 2].values

X_new = []
y_new = []
i = 0
n = 0
while i < len(X):
    while n < len(outlier_x):
        if (X[i] == outlier_x[n] and y[i] == outlier_y[n]):
            continue
        X_new.append(X[i])
        y_new.append(y[i])
        n+=
    i+=1

Here is my dataset:

      x          y
0   0.0   0.998440
1   1.0   2.188544
2   4.0   7.572174
3   7.0   6.138442
4  11.0  11.737930
5   0.0   1.043314
6   1.0   1.733181
7   4.0   7.424136
8   7.0   6.138442
9  11.0   9.737930

And these points, which have been previously identified as outliers:

      x          y
0   4.0   7.572174
1   7.0   6.138442
2  11.0  11.737930
3   4.0   7.424136
4   7.0   6.138442

Upvotes: 0

Answers (3)

ddejohn

Reputation: 8962

Solution

data[~np.isin(data, outliers).all(axis=1)]

Steps

Starting with these two DataFrames:

In [3]: data
Out[3]:
      x          y
0   0.0   0.998440
1   1.0   2.188544
2   4.0   7.572174
3   7.0   6.138442
4  11.0  11.737930
5   0.0   1.043314
6   1.0   1.733181
7   4.0   7.424136
8   7.0   6.138442
9  11.0   9.737930

In [4]: outliers
Out[4]:
      x          y
0   4.0   7.572174
1   7.0   6.138442
2  11.0  11.737930
3   4.0   7.424136
4   7.0   6.138442

We can use the np.isin() function to check if any row in data has any values that match with any row in outliers:

In [5]: np.isin(data, outliers)
Out[5]:
array([[False, False],
       [False, False],
       [ True,  True],
       [ True,  True],
       [ True,  True],
       [False, False],
       [False, False],
       [ True,  True],
       [ True,  True],
       [ True, False]])

Since we want a full match (both x- and y-coordinates), use all() along the first axis (across the columns):

In [6]: np.isin(data, outliers).all(axis=1)
Out[6]:
array([False, False,  True,  True,  True, False, False,  True,  True,
       False])

This boolean mask tells us which rows match exactly with an outlier. All we need to do is invert the mask (since we want to filter outliers), and index into data with that mask:

In [7]: data[~np.isin(data, outliers).all(axis=1)]
Out[7]:
      x         y
0   0.0  0.998440
1   1.0  2.188544
5   0.0  1.043314
6   1.0  1.733181
9  11.0  9.737930

From there, you can do whatever you like with the x and y columns.

Alternate solution, starting with 1D arrays

If you have separate 1D arrays, X and y, and the same for the outliers, you can zip them into tuples and add them to a set and then subtract the outliers from the set:

points = set(zip(X, y))
outliers = set(zip(outlier_x, outlier_y))
X_new, y_new = zip(*(points - outliers))

Upvotes: 2

sin tribu

Reputation: 1180

Edit: Apologies, not only did I misunderstand your question I gave you a bad solution. Since you're checking for equality of your x,y coords in the outliers (as opposed to < >), then the following should work

X = [0, 1, 4, 7, 11, 0, 1, 4, 7, 11]   
Y = [0.99844039, 2.188544418, 7.572173987, 6.138441957, 11.73792995, 1.043313797, 1.733181475, 7.424136351, 6.138441957, 9.73792995]
outlier_X = [4, 7, 11, 4, 7]
outlier_Y = [7.572173987, 6.138441957, 11.73792995, 7.424136351, 6.138441957]
final_X = [] 
final_Y = []
for xi, yi in zip(X, Y):

    is_valid = not any([xi == ox and yi== oy  for ox, oy in zip(outlier_X, outlier_Y)])
    if is_valid:
        final_X.append(xi)
        final_Y.append(yi)

print(final_X)
print(final_Y)

Upvotes: 0

Viki Liu

Reputation: 112

Here is the working solution based on @sin tribu answer:

X_new = []
y_new = []

for xi, yi in zip(X, y):
    x_is_valid = all([xi != ox  or xi == ox for ox in outlier_x])
    y_is_valid = all([yi != oy and yi != oy for oy in outlier_y])
    if x_is_valid and y_is_valid:
            X_new.append(xi)
            y_new.append(yi)

Upvotes: 0

Remove outliers from the lists of coordinates

Answers (3)

Solution

Steps

Alternate solution, starting with 1D arrays

Related Questions