Reputation: 1840
Little intro
I have data (link at the bottom), with on the y-axis the score, x-axis the position, for different labels. Now I want to know if there is one label that is "significantly" different than the others and the "background". I have been playing with this the last few weeks but can't seem to figure it out (used watershedding, DBscan, LOF, and a couple more algos). Pretty sure there is a smart way to do this :).
Note, this is just one of the many kind of plots and we can't always assume a k
, as some have outliers and others don't
Lets take a look at the plot to get an idea:
Here we can see that this olive color deviates(top score point circled in red):
Using DBSCAN
I though of using DBscan which does quite well:
but the data seems to have some clear horizontal clustering of "bands" of lines, however I can't find a way to cluster such pattern.
Description of band clusters
I thought it would be possible to cluster to something like the image below. I should note that since there are so many points the plots only show the top 200 points per label. So it's possible that x-y coordinates are not present at all positions. So perhaps we can't call them "lines" anymore. For the outlier I can then probably just check:
Data
I put the data for plot shown on pastebin, part of it here:
28 1 0.16
17 1 0.14705882352941177
12 1 0.16
54 1 0.16666666666666666
2 1 0.18
8 1 0.11
42 1 0.14705882352941177
16 1 0.14705882352941177
44 1 0.19607843137254902
1 1 0.4
36 1 0.16
55 1 0.12745098039215685
50 1 0.12745098039215685
22 1 0.16666666666666666
46 1 0.1568627450980392
5 1 0.13
...
where the first column is the label (color), second column position (x-axis), and last column the score (y-axis)
Thanks a lot, really curious if there are some cool ideas for this. Breaking my head about this for the last couple of weeks :)
Upvotes: 2
Views: 174
Reputation: 1419
Baseline approach would be to use standard signal processing algorithms. For example one approach can be to calculate signal height (level) of it's peak width. Magically the signal which you have identified as outlier has 0.64 level of it's peak width, which is above than the rest signals max values:
Here how you can do this:
Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks, peak_prominences, peak_widths
Data Loading:
(I've downloaded the data in csv
format)
df = pd.read_csv('data.csv', sep='\t', header=None)
df = df.rename(columns={0: 'label', 1: 'x', 2: 'y'})
# sort by x for sake of security
df_sorted = df.sort_values(by=['x'])
Approach:
def get_peak_width_height(y, prominence_threhsold=0.2):
peaks, _ = find_peaks(y)
prominences = peak_prominences(y, peaks)[0]
# filter small peaks, default prominence is 0.2, but you can play with it
indicies = np.where(prominences > prominence_threhsold)[0]
if len(indicies) == 0:
return 0 # if no peak was found greater than threshold return 0
widths, width_heights, left_ips, right_ips = peak_widths(y, peaks[indicies])
# returns max of signal level on which width (period) is calculated
return width_heights.max()
outliers = []
# itereate over each label
for i in df_sorted.label.unique():
# get reference data
y = df_sorted[df_sorted.label==i].y.values
# get the rest data (without reference)
y_rest = df_sorted[df_sorted.label!=i].y.values
# calculate peak width level of reference data
peak_width = get_peak_width_height(y)
# if the rest data is below the height of reference data, than it is outlier
if np.all(y_rest < peak_width):
outliers.append(i)
print(f"Outlier labels: {outliers}")
Alternatively we can check if let's say 90% of each signal in "the rest" is below the reference data peak width level, than the reference is outlier. This will give you some control, in terms of fine tuning the threshold.
outlier_threshold = 0.9
outliers = []
for i in df_sorted.label.unique():
y = df_sorted[df_sorted.label==i].y.values
peak_width = get_peak_width_height(y)
criteria_met = True
for _, df_group in df_sorted[df_sorted.label!=i].groupby('label'):
y_rest = df_group.y.values
# calculates the portion of how many data points lies
# below the reference threshold
if len(y_rest[y_rest<peak_width])/len(y_rest) < outlier_threshold:
criteria_met = False
break
# if all signals data points 90% (outlier_threshold) are below the reference threshold
if criteria_met:
outliers.append(i)
print(f"Outlier labels: {outliers}")
Another baseline approach can be to calculate pairwise difference of the signals and the areas under the curve of that differences. Than to find some optimal threshold, if all are above than the given signal is outlier.
Another approach can be to use LocalOutlierFactor
, IsolationForest
or other unsupervised outlier or anomaly detection model. But, firstly, all the signals should be transformed to have the same length. For that purpose, you can do piecewise linear interpolation per each label and fit LocalOutlierFactor
model. Here each sample represents unique label. However this approach is very sensitive to n_neighbors
parameter. Heuristically I got 18, but I recommend you to play with it to see how this affects the results. In most cases you will get 43 and 1 as outlier labels. In some sense label=1 can be also considered as outlier.
import scipy
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
def interpolate(x, y, number_of_points=1000, x_min=0, x_max=300):
f = scipy.interpolate.interp1d(x, y, fill_value='extrapolate', kind='linear')
x = np.linspace(x_min, x_max, number_of_points)
return f(x)
# interpolate to make all signals to be the same length
signals = []
x_min = df.x.min()
x_max = df.x.max()
for i in df_sorted.label.unique():
x = df_sorted[df_sorted.label==i].x.values
y = df_sorted[df_sorted.label==i].y.values
signals.append(interpolate(x, y, x_min=x_min, x_max=x_max))
X = np.array(signals)
model = LocalOutlierFactor(n_neighbors=18)
# Fit each signal as data point
pred = model.fit_predict(X)
print(f"Outlier labels: {df_sorted.label.unique()[pred == -1]}")
Worth trying some time series anomaly or outlier detection approaches as well.
Upvotes: 1