Krzysztof Słowiński
Krzysztof Słowiński

Reputation: 7207

Detecting outliers in a DataFrame column with small value changes in pandas

I am working with a column which values should have small changes between the rows. The values are physical measurements, and due to the environment factors, the measurement values can be incorrect, with a very high increments between the consecutive samples. Rate of change is a quantity that is considered as an input to the problem, as it can be changed to adapt to the precision needs of this outliers detection.

The detection method could either calculate the mean of the values seen so far and mark outliers as values that are above it by the given rate of change or check the value changes between the rows and mark the index value where the distance was greater than the rate of change and the index value where the values returned below the accepted rate of change with respect to the first value before the ones marked as outliers. The first approach could be harder, since the mean should be calculated from values that are correct, that is the values marked as outliers should not be considered into the calculation of the mean.

The correct solution should return the list of indices that indicate the outliers, which would be then used to set the corresponding values to f.e. NaN or use an interpolation method to fill in those values.

Example

df = pd.DataFrame({'small_changing': [5.14, 5.18, 5.22, 5.18, 5.20, 5.17, 5.25, 5.55, 5.62, 5.78, 6.21, 6.13, 5.71, 5.35, 5.29, 5.24, 5.16, 5.18, 5.20, 5.15, 5.17, 5.00, 4.96, 4.88, 4.71, 4.65, 4.73, 4.79, 4.89, 4.92, 5.05, 5.11, 5.14, 5.17, 5.22, 5.24, 5.18, 5.20]})

enter image description here

Assuming the rate of change of 0.15 there are two outliers groups to detect assuming the second approach of detection where the difference between the rows is taken into account.

The first group corresponds to the index values [7, 12], because the difference between the rows 6 and 7 is 0.3, which is higher than the 0.15 limit, and the difference between the rows 6 and 13 is 0.1, row 13 being the first row with the difference within the 0.15 limit.

The second group corresponds to the index values [21, 29], because the difference between the rows 20 and 21 is 0.17, which is higher than the 0.15 limit, and the difference between the rows 20 and 30 is 0.12, row 30 being the first row with the difference within the 0.15 limit.

Result for this example: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]

Upvotes: 0

Views: 503

Answers (2)

Sadidul Islam
Sadidul Islam

Reputation: 1348

This might save time on sparsely distributed outliers on a big dataset -

def df_outlier(df, threshold=0.15):
    column = df.columns[0]
    df["outlier"] = False
    df_difference = df.copy()
    df_difference["difference"] = abs(df[column] - df[column].shift(1)).shift(-1)
    df_difference = df_difference.loc[df_difference["difference"] > threshold]
    for index in df_difference.index:
        row = df.loc[index]
        if not row["outlier"]:
            df_check = df[index+1:].copy()
            df_check["a_difference"] = abs(df_check[column] - row[column])
            df_check.loc[df_check["a_difference"] > threshold, "outlier"] = True
            df.loc[((df.index >= df_check.index[0]) & (df.index < df_check["outlier"].ne(True).idxmax())), "outlier"] = True

    return list(df.loc[df["outlier"] == True].index)

I am using this.

Upvotes: 0

I hope it will be helpful.

I think it isn't pythonic, but it works:

def outlier_detection(points, limit):
    outliers_index = list()
    k=0
    for i in range(0,len(points)-1):
        if abs(points[i-k] - points[i+1]) >= limit:
            k+=1
            outliers_index.append(i+1)
        else:
            k=0
    return outliers_index

outlier_detection(df['small_changing'].values, 0.15)

OUT: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]

Upvotes: 3

Related Questions