CodeNoob
CodeNoob

Reputation: 1840

Clustering lines in bands

Little intro

I have data (link at the bottom), with on the y-axis the score, x-axis the position, for different labels. Now I want to know if there is one label that is "significantly" different than the others and the "background". I have been playing with this the last few weeks but can't seem to figure it out (used watershedding, DBscan, LOF, and a couple more algos). Pretty sure there is a smart way to do this :).

Note, this is just one of the many kind of plots and we can't always assume a k, as some have outliers and others don't

Lets take a look at the plot to get an idea: enter image description here

Here we can see that this olive color deviates(top score point circled in red):

Using DBSCAN

I though of using DBscan which does quite well: enter image description here but the data seems to have some clear horizontal clustering of "bands" of lines, however I can't find a way to cluster such pattern.

Description of band clusters

I thought it would be possible to cluster to something like the image below. I should note that since there are so many points the plots only show the top 200 points per label. So it's possible that x-y coordinates are not present at all positions. So perhaps we can't call them "lines" anymore. For the outlier I can then probably just check:

Data

I put the data for plot shown on pastebin, part of it here:

28  1   0.16
17  1   0.14705882352941177
12  1   0.16
54  1   0.16666666666666666
2   1   0.18
8   1   0.11
42  1   0.14705882352941177
16  1   0.14705882352941177
44  1   0.19607843137254902
1   1   0.4
36  1   0.16
55  1   0.12745098039215685
50  1   0.12745098039215685
22  1   0.16666666666666666
46  1   0.1568627450980392
5   1   0.13
...

where the first column is the label (color), second column position (x-axis), and last column the score (y-axis)

Thanks a lot, really curious if there are some cool ideas for this. Breaking my head about this for the last couple of weeks :)

Upvotes: 2

Views: 174

Answers (1)

Vardan Grigoryants
Vardan Grigoryants

Reputation: 1419

Baseline approach would be to use standard signal processing algorithms. For example one approach can be to calculate signal height (level) of it's peak width. Magically the signal which you have identified as outlier has 0.64 level of it's peak width, which is above than the rest signals max values:

Outlier Signal

Here how you can do this:

Imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import find_peaks, peak_prominences, peak_widths

Data Loading: (I've downloaded the data in csv format)

df = pd.read_csv('data.csv', sep='\t', header=None)
df = df.rename(columns={0: 'label', 1: 'x', 2: 'y'})
# sort by x for sake of security
df_sorted = df.sort_values(by=['x'])

Approach:

def get_peak_width_height(y, prominence_threhsold=0.2):
    peaks, _ = find_peaks(y)
    prominences = peak_prominences(y, peaks)[0]

    # filter small peaks, default prominence is 0.2, but you can play with it
    indicies = np.where(prominences > prominence_threhsold)[0]

    if len(indicies) == 0:
        return 0 # if no peak was found greater than threshold return 0

    widths, width_heights, left_ips, right_ips = peak_widths(y, peaks[indicies])

    # returns max of signal level on which width (period) is calculated
    return width_heights.max()

outliers = []

# itereate over each label
for i in df_sorted.label.unique():
    # get reference data
    y = df_sorted[df_sorted.label==i].y.values
    
    # get the rest data (without reference)
    y_rest = df_sorted[df_sorted.label!=i].y.values
    
    # calculate peak width level of reference data 
    peak_width = get_peak_width_height(y)

    # if the rest data is below the height of reference data, than it is outlier
    if np.all(y_rest < peak_width):
        outliers.append(i)

print(f"Outlier labels: {outliers}")

Alternatively we can check if let's say 90% of each signal in "the rest" is below the reference data peak width level, than the reference is outlier. This will give you some control, in terms of fine tuning the threshold.

outlier_threshold = 0.9
outliers = []

for i in df_sorted.label.unique():
    
    y = df_sorted[df_sorted.label==i].y.values
    peak_width = get_peak_width_height(y)
    
    criteria_met = True
    for _, df_group in df_sorted[df_sorted.label!=i].groupby('label'):
        y_rest = df_group.y.values

        # calculates the portion of how many data points lies
        # below the reference threshold 
        if len(y_rest[y_rest<peak_width])/len(y_rest) < outlier_threshold:
            criteria_met = False
            break

    # if all signals data points 90% (outlier_threshold) are below the reference threshold
    if criteria_met:
        outliers.append(i)

print(f"Outlier labels: {outliers}")

Another baseline approach can be to calculate pairwise difference of the signals and the areas under the curve of that differences. Than to find some optimal threshold, if all are above than the given signal is outlier.

Another approach can be to use LocalOutlierFactor, IsolationForest or other unsupervised outlier or anomaly detection model. But, firstly, all the signals should be transformed to have the same length. For that purpose, you can do piecewise linear interpolation per each label and fit LocalOutlierFactor model. Here each sample represents unique label. However this approach is very sensitive to n_neighbors parameter. Heuristically I got 18, but I recommend you to play with it to see how this affects the results. In most cases you will get 43 and 1 as outlier labels. In some sense label=1 can be also considered as outlier.

import scipy
import numpy as np
from sklearn.neighbors import LocalOutlierFactor

def interpolate(x, y, number_of_points=1000, x_min=0, x_max=300):
    f = scipy.interpolate.interp1d(x, y, fill_value='extrapolate', kind='linear')
    x = np.linspace(x_min, x_max, number_of_points)
    return f(x)

# interpolate to make all signals to be the same length
signals = []
x_min = df.x.min()
x_max = df.x.max()
for i in df_sorted.label.unique():
    x = df_sorted[df_sorted.label==i].x.values
    y = df_sorted[df_sorted.label==i].y.values
    signals.append(interpolate(x, y, x_min=x_min, x_max=x_max))

X = np.array(signals)

model = LocalOutlierFactor(n_neighbors=18)

# Fit each signal as data point
pred = model.fit_predict(X)

print(f"Outlier labels: {df_sorted.label.unique()[pred == -1]}")

Worth trying some time series anomaly or outlier detection approaches as well.

Upvotes: 1

Related Questions