shantanuo
shantanuo

Reputation: 32286

How to find values below (or above) average

As you can see from the following summary, the count for 1 Sep (1542677) is way below the average count per month.

from StringIO import StringIO

myst="""01/01/2016  8781262
01/02/2016  8958598
01/03/2016  8787628
01/04/2016  9770861
01/05/2016  8409410
01/06/2016  8924784
01/07/2016  8597500
01/08/2016  6436862
01/09/2016  1542677
"""
u_cols=['month', 'count']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='\t', names = u_cols)

Is there a mathematical formula that can define this "way below or too high" (ambiguous) concept?

This is easy if I define a limit (for e.g. 9 or 10%). But I want the script to decide that for me and return the values if the difference between the lowest and second last lowest value is more than overall 5%. In this case the September month count should be returned.

Upvotes: 6

Views: 5411

Answers (3)

fr_andres
fr_andres

Reputation: 6687

first of all, the "way below or too high" concept you refer to is known as Outlier, and quoting Wikipedia (not the best source),

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

But on the other side:

In general, if the nature of the population distribution is known a priori, it is possible to test if the number of outliers deviate significantly from what can be expected.

So in my opinion this boils down to the question, wether is it possible to make assumptions about the nature of your data, to be able to automatize such decissions.

STRAIGHTFORWARD APPROACH

If you are lucky enough to have a relatively big sample size, and your different samples aren't correlated, you can apply the central limit theorem, which states that your values will follow a normal distribution (see this for a python-related explanation).

In this context, you may be able to quickly get the mean value and standard deviation of the given dataset. And by applying the corresponding function (with this two parameters) to each given value you can calculate its probability of belonging to the "cluster" (see this stackoverflow post for a possible python solution).

Then you do have to put a lower bound, since this distribution returns 0% probability only when a point is infinitely far away from the mean value. But the good thing is that (if the assumptions are true) this bound will nicely adapt to each different dataset, because of its exponential, normalized nature. This bound is typically expressed in Sigma unities, and widely used in science and statistics. As a matter of fact, the Physics Nobel Price 2013, dedicated to the discovery of Higgs boson, was granted after a 5-sigma range was reached, quoting the link:

High-energy physics requires even lower p-values to announce evidence or discoveries. The threshold for "evidence of a particle," corresponds to p=0.003, and the standard for "discovery" is p=0.0000003.

ALTERNATIVES

If you cannot make such simple assumptions of how your data should look like, you can always let a program infere them. This approach is a core feature of most machine learning algorithms, which can nicely adapt to strong correlated and even skewed data if finetuned properly. If this is what you need, Python has many good libraries for that purpose, that can even fit in a small script (the one I know best is tensorflow from google).

In this case I would regard two different approaches, depending again on how does your data look like:

  • Supervised learning: In case you have a training set at disposal, that states which samples belong and which ones don't (known as labeled), there are algorithms like the support vector machine that, although lightweight, can adapt to highly non-linear boundaries amazingly.

  • Unsupervised learning: This is probably what I would try first: When you simply have the unlabeled dataset. The "straightforward approach" I mentioned before is the simplest case of anomaly detector, and thus can be highly tweaked and customized to also regard correlations in an even infinite amount of dimensions, due to the kernel trick. To understand the motivations and approach of a ML-based anomaly detector, I would suggest to take a look at Andrew Ng's videos on the matter.

I hope it helps! Cheers

Upvotes: 3

bn2302
bn2302

Reputation: 76

One way to filter outliers is the interquartile range (IQR, wikipedia), which is the difference between 75% (Q3) and 25% quartile (Q1).

The outliers are defined if the data falls below Q1 - k * IQR resp. above Q3 + k * IQR.

You can select the constant k based on your domain knowledge (a common choice is 1.5).

Given the data, a filter in pandas could look like this:

iqr_filter = pd.DataFrame(df["count"].quantile([0.25, 0.75])).T
iqr_filter["iqr"] = iqr_filter[0.75]-iqr_filter[0.25]
iqr_filter["lo"] = iqr_filter[0.25] - 1.5*iqr_filter["iqr"]
iqr_filter["up"] = iqr_filter[0.75] + 1.5*iqr_filter["iqr"]
df_filtered = df.loc[(df["count"] > iqr_filter["lo"][0]) & (df["count"] < iqr_filter["up"][0]), :]

Upvotes: 2

piRSquared
piRSquared

Reputation: 294508

A very common approach to filtering outliers is to use standard deviation. In this case, we will calculate a zscore which will quickly identify how many standard deviations away from the mean each observation is. We can then filter those observations that are greater than 2 standard deviations. For normally distributed random variables, this should happen approximately 5% of the time.

Define a zscore function

def zscore(s):
    return (s - np.mean(s)) / np.std(s)

Apply it to the count column

zscore(df['count'])

0    0.414005
1    0.488906
2    0.416694
3    0.831981
4    0.256946
5    0.474624
6    0.336390
7   -0.576197
8   -2.643349
Name: count, dtype: float64

Notice that the September observation is 2.6 standard deviations away.

Use abs and gt to identify outliers

zscore(df['count']).abs().gt(2)

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
Name: count, dtype: bool

Again, September comes back true.

Tie it all together to filter your original dataframe

df[zscore(df['count']).abs().gt(2)]

enter image description here

filter the other way

df[zscore(df['count']).abs().le(2)]

enter image description here

Upvotes: 5

Related Questions