Reputation: 7207
I am working with a column which values should have small changes between the rows. The values are physical measurements, and due to the environment factors, the measurement values can be incorrect, with a very high increments between the consecutive samples. Rate of change is a quantity that is considered as an input to the problem, as it can be changed to adapt to the precision needs of this outliers detection.
The detection method could either calculate the mean of the values seen so far and mark outliers as values that are above it by the given rate of change or check the value changes between the rows and mark the index value where the distance was greater than the rate of change and the index value where the values returned below the accepted rate of change with respect to the first value before the ones marked as outliers. The first approach could be harder, since the mean should be calculated from values that are correct, that is the values marked as outliers should not be considered into the calculation of the mean.
The correct solution should return the list of indices that indicate the outliers, which would be then used to set the corresponding values to f.e. NaN
or use an interpolation method to fill in those values.
Example
df = pd.DataFrame({'small_changing': [5.14, 5.18, 5.22, 5.18, 5.20, 5.17, 5.25, 5.55, 5.62, 5.78, 6.21, 6.13, 5.71, 5.35, 5.29, 5.24, 5.16, 5.18, 5.20, 5.15, 5.17, 5.00, 4.96, 4.88, 4.71, 4.65, 4.73, 4.79, 4.89, 4.92, 5.05, 5.11, 5.14, 5.17, 5.22, 5.24, 5.18, 5.20]})
Assuming the rate of change of 0.15
there are two outliers groups to detect assuming the second approach of detection where the difference between the rows is taken into account.
The first group corresponds to the index values [7, 12]
, because the difference between the rows 6
and 7
is 0.3
, which is higher than the 0.15
limit, and the difference between the rows 6
and 13
is 0.1
, row 13
being the first row with the difference within the 0.15
limit.
The second group corresponds to the index values [21, 29]
, because the difference between the rows 20
and 21
is 0.17
, which is higher than the 0.15
limit, and the difference between the rows 20
and 30
is 0.12
, row 30
being the first row with the difference within the 0.15
limit.
Result for this example: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]
Upvotes: 0
Views: 503
Reputation: 1348
This might save time on sparsely distributed outliers on a big dataset -
def df_outlier(df, threshold=0.15):
column = df.columns[0]
df["outlier"] = False
df_difference = df.copy()
df_difference["difference"] = abs(df[column] - df[column].shift(1)).shift(-1)
df_difference = df_difference.loc[df_difference["difference"] > threshold]
for index in df_difference.index:
row = df.loc[index]
if not row["outlier"]:
df_check = df[index+1:].copy()
df_check["a_difference"] = abs(df_check[column] - row[column])
df_check.loc[df_check["a_difference"] > threshold, "outlier"] = True
df.loc[((df.index >= df_check.index[0]) & (df.index < df_check["outlier"].ne(True).idxmax())), "outlier"] = True
return list(df.loc[df["outlier"] == True].index)
I am using this.
Upvotes: 0
Reputation: 1433
I hope it will be helpful.
I think it isn't pythonic, but it works:
def outlier_detection(points, limit):
outliers_index = list()
k=0
for i in range(0,len(points)-1):
if abs(points[i-k] - points[i+1]) >= limit:
k+=1
outliers_index.append(i+1)
else:
k=0
return outliers_index
outlier_detection(df['small_changing'].values, 0.15)
OUT: [7, 8, 9, 10, 11, 12, 21, 22, 23, 24, 25, 26, 27, 28, 29]
Upvotes: 3