Reputation: 112
I have 2 lists X
and y
with ten coordinate values.
I also have 2 additional lists for point outliers: outlier_x and outlier_y.
I want to go over my X and Y lists, compare their coordinates with outliers and create new 2 lists (X_new and y_new) which will include points without outliers.
There is my nested loops where I stacked. It records only first point and sims doesn't move to the next point. Can you please help to fix it?
X = dataset.iloc[:, 1].values
X = X.reshape(len(X),1)
y = dataset.iloc[:, 2].values
X_new = []
y_new = []
i = 0
n = 0
while i < len(X):
while n < len(outlier_x):
if (X[i] == outlier_x[n] and y[i] == outlier_y[n]):
continue
X_new.append(X[i])
y_new.append(y[i])
n+=
i+=1
Here is my dataset:
x y
0 0.0 0.998440
1 1.0 2.188544
2 4.0 7.572174
3 7.0 6.138442
4 11.0 11.737930
5 0.0 1.043314
6 1.0 1.733181
7 4.0 7.424136
8 7.0 6.138442
9 11.0 9.737930
And these points, which have been previously identified as outliers:
x y
0 4.0 7.572174
1 7.0 6.138442
2 11.0 11.737930
3 4.0 7.424136
4 7.0 6.138442
Upvotes: 0
Views: 974
Reputation: 8962
data[~np.isin(data, outliers).all(axis=1)]
Starting with these two DataFrames:
In [3]: data
Out[3]:
x y
0 0.0 0.998440
1 1.0 2.188544
2 4.0 7.572174
3 7.0 6.138442
4 11.0 11.737930
5 0.0 1.043314
6 1.0 1.733181
7 4.0 7.424136
8 7.0 6.138442
9 11.0 9.737930
In [4]: outliers
Out[4]:
x y
0 4.0 7.572174
1 7.0 6.138442
2 11.0 11.737930
3 4.0 7.424136
4 7.0 6.138442
We can use the np.isin()
function to check if any row in data
has any values that match with any row in outliers
:
In [5]: np.isin(data, outliers)
Out[5]:
array([[False, False],
[False, False],
[ True, True],
[ True, True],
[ True, True],
[False, False],
[False, False],
[ True, True],
[ True, True],
[ True, False]])
Since we want a full match (both x- and y-coordinates), use all()
along the first axis (across the columns):
In [6]: np.isin(data, outliers).all(axis=1)
Out[6]:
array([False, False, True, True, True, False, False, True, True,
False])
This boolean mask tells us which rows match exactly with an outlier. All we need to do is invert the mask (since we want to filter outliers), and index into data
with that mask:
In [7]: data[~np.isin(data, outliers).all(axis=1)]
Out[7]:
x y
0 0.0 0.998440
1 1.0 2.188544
5 0.0 1.043314
6 1.0 1.733181
9 11.0 9.737930
From there, you can do whatever you like with the x
and y
columns.
If you have separate 1D arrays, X
and y
, and the same for the outliers, you can zip them into tuples and add them to a set and then subtract the outliers from the set:
points = set(zip(X, y))
outliers = set(zip(outlier_x, outlier_y))
X_new, y_new = zip(*(points - outliers))
Upvotes: 2
Reputation: 1180
<
>
), then the following should workX = [0, 1, 4, 7, 11, 0, 1, 4, 7, 11]
Y = [0.99844039, 2.188544418, 7.572173987, 6.138441957, 11.73792995, 1.043313797, 1.733181475, 7.424136351, 6.138441957, 9.73792995]
outlier_X = [4, 7, 11, 4, 7]
outlier_Y = [7.572173987, 6.138441957, 11.73792995, 7.424136351, 6.138441957]
final_X = []
final_Y = []
for xi, yi in zip(X, Y):
is_valid = not any([xi == ox and yi== oy for ox, oy in zip(outlier_X, outlier_Y)])
if is_valid:
final_X.append(xi)
final_Y.append(yi)
print(final_X)
print(final_Y)
Upvotes: 0
Reputation: 112
Here is the working solution based on @sin tribu answer:
X_new = []
y_new = []
for xi, yi in zip(X, y):
x_is_valid = all([xi != ox or xi == ox for ox in outlier_x])
y_is_valid = all([yi != oy and yi != oy for oy in outlier_y])
if x_is_valid and y_is_valid:
X_new.append(xi)
y_new.append(yi)
Upvotes: 0