Victor Juliet
Victor Juliet

Reputation: 1092

Finding very large jumps in data

I need to find very large jumps only so that I can find clusters and later the noise as well. The sample data is as under:

0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.000000
0.000500
0.001500
0.003000
0.005500
0.008700
0.012400
0.012400

I need to do this in python, but any generic algorithm would be welcome as well.

I have already tried

  1. Finding distance between each consecutive pair of points.
  2. Find the ratio of consecutive distances.
  3. Find the closeness of the consecutive ratios.

The problem I face is when I use the compare function numpy.allclose() , its approximation factor is static and for varying degree of jumps, it stops working and gives false positives and false negatives.

Some of the graphs for data visualization. The bottom graph in each is the total number of points. enter image description here enter image description here enter image description here

Upvotes: 1

Views: 7584

Answers (3)

Imanol Luengo
Imanol Luengo

Reputation: 15889

This is nothing fancy, but you could try it. Using forward, backward differences you can detect single outliers. It will fail in complex cases if multiple outliers lay together tho, but for simple cases it might work:

import numpy as np
x = np.arange(20)    
# Synthetic data
sample = np.random.randn(20)
# Synthetic noise
sample[np.random.randint(0, 20, 5)] += np.random.randn(5) * 100

plot(x, sample, 'o')

enter image description here

Get forward and backward derivatives (their absolute value as we care about magnitude of differences not direction):

d1 = np.r_[0, np.abs(sample[1:] - sample[:-1])]
d2 = np.r_[np.abs(sample[1:] - sample[:-1]), 0]

Mask the inliers with a threshold (they have at least another node close):

mask = (d1 < 5) | (d2 < 5)

Show results:

plot(x[mask], samples[mask], 'o')

The hardcoded 5 for the example could be replaced by the mean or the median + std or something.

Again, is not something fancy, as @septi pointed out, there is a lot of theory in outlier detection and no easy approach solves every problem. For more information you can check density based outlier deetection, which I think is suitable for your problem.

enter image description here

Upvotes: 0

farhawa
farhawa

Reputation: 10398

Your approach works if you compute the closeness "manually" this way:

import numpy as np
data = np.array([0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
        0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
        0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
        0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
        0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
        0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
        0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 
        0.012400])
steps = data[1:] - data[:-1]
ratios = 1. * steps[1:] / steps[:-1]
jumps = ratios[1:] - ratios[:-1]
largest_jumps = np.max(jumps)
print largest_jumps

>> 3.31102877071

Upvotes: 2

tamasgal
tamasgal

Reputation: 26259

First, you should visualise your problem to get a better understanding what's going on:

import matplotlib.pyplot as plt
data = (0.000000, 0.000500, 0.001500, 0.003000, 0.005500, 0.008700,
        0.012400, 0.000000, 0.000500, 0.001500, 0.003000, 0.005500,
        0.008700, 0.012400, 0.000000, 0.000500, 0.001500, 0.003000,
        0.005500, 0.008700, 0.012400, 0.000000, 0.000500, 0.001500,
        0.003000, 0.005500, 0.008700, 0.012400, 0.000000, 0.000500,
        0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 0.000000,
        0.000500, 0.001500, 0.003000, 0.005500, 0.008700, 0.012400, 
        0.012400)
plt.scatter(range(len(data)), data)

Scatter plot of data

Second, you need to implement a step detection, which is well described on the wiki: http://en.wikipedia.org/wiki/Step_detection

Choose a method you think would fit best and play around with it.

UPDATE

Just a thought: if all your data look similar to your example, you could also simply try to make a sawtooth wave (http://en.wikipedia.org/wiki/Sawtooth_wave) least square fit (http://en.wikipedia.org/wiki/Least_squares) to find the "jumps". This could be a starting point for further analysis.

Upvotes: 7

Related Questions