Clustering lines in bands

Question

Little intro

I have data (link at the bottom), with on the y-axis the score, x-axis the position, for different labels. Now I want to know if there is one label that is "significantly" different than the others and the "background". I have been playing with this the last few weeks but can't seem to figure it out (used watershedding, DBscan, LOF, and a couple more algos). Pretty sure there is a smart way to do this :).

Note, this is just one of the many kind of plots and we can't always assume a k, as some have outliers and others don't

Lets take a look at the plot to get an idea:

Here we can see that this olive color deviates(top score point circled in red):

Deviates from the background (most scores, expect the first peak are between 0.1-0.3)
Still uniquely rises above the others that stop at ~0.6, basically being a local outlier in the global outlier region. Now we can set all kinds of cut-offs, like minimum score to be a global outlier, and minimum difference between the top score and second score for a different label. But this is all so subjective.

Using DBSCAN

I though of using DBscan which does quite well: but the data seems to have some clear horizontal clustering of "bands" of lines, however I can't find a way to cluster such pattern.

Description of band clusters

I thought it would be possible to cluster to something like the image below. I should note that since there are so many points the plots only show the top 200 points per label. So it's possible that x-y coordinates are not present at all positions. So perhaps we can't call them "lines" anymore. For the outlier I can then probably just check:

is the "top" band not the same as the bottom band
does the top band only contain a single label

Data

I put the data for plot shown on pastebin, part of it here:

28  1   0.16
17  1   0.14705882352941177
12  1   0.16
54  1   0.16666666666666666
2   1   0.18
8   1   0.11
42  1   0.14705882352941177
16  1   0.14705882352941177
44  1   0.19607843137254902
1   1   0.4
36  1   0.16
55  1   0.12745098039215685
50  1   0.12745098039215685
22  1   0.16666666666666666
46  1   0.1568627450980392
5   1   0.13
...

where the first column is the label (color), second column position (x-axis), and last column the score (y-axis)

Thanks a lot, really curious if there are some cool ideas for this. Breaking my head about this for the last couple of weeks :)

Clustering lines in bands

Answers (1)

Related Questions