Reputation: 4564

Python Detect isolated edges in the histogram plot for outliers detection in time-series data

I am attempting to find out outliers my own way. How? Plot the histogram, search for isolated edges with a few counts and zero-count neighbors or edges. Usually they will be at the far end of the histogram. Those could be outliers. Detect and drop them. What kind of data is it? Time-series coming from the field. Sometimes, you would see weird numbers (while sensors data is around 50-100, outliers may be -10000, 1000) when the sensors fail to communicate data in time and the data loggers stores these weird numbers. They are momentary, may occur a few times in a year data and would be less than 1 % of total samples.

My code:

# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is 
vals = [    38      0      0      0      0      0      0      0      0      0
      0      0      0      0      0      1     11 126664  13853   4536]
edges = [ 0.        2.911165  5.82233   8.733495 11.64466  14.555825 17.46699
 20.378155 23.28932  26.200485 29.11165  32.022815 34.93398  37.845145
 40.75631  43.667475 46.57864  49.489805 52.40097  55.312135 58.2233  ]

# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized. 
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()

Present output:

This is not a correct output. Why? There is only one isolated edge at the beginning at value 0. However, here, my code detected values at 43 and 46 as isolated ones just because they have less count.

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     True
16  46.578640   11.0    True
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

Expected output:

vedf = 

      edges     vals    IsolatedEdge?
0   0.000000    38.0    True
1   2.911165    NaN     False
2   5.822330    NaN     False
3   8.733495    NaN     False
4   11.644660   NaN     False
5   14.555825   NaN     False
6   17.466990   NaN     False
7   20.378155   NaN     False
8   23.289320   NaN     False
9   26.200485   NaN     False
10  29.111650   NaN     False
11  32.022815   NaN     False
12  34.933980   NaN     False
13  37.845145   NaN     False
14  40.756310   NaN     False
15  43.667475   1.0     False
16  46.578640   11.0    False
17  49.489805   126664.0    False
18  52.400970   13853.0     False
19  55.312135   4536.0  False
20  58.223300   4536.0  False

Once, I know a specific edge is isolated one, I can drop all the samples in the edge.

Upvotes: 0

Answers (2)

Mainland

Reputation: 4564

I am posting here my complete solution after the @Mark suggestion:

# index of normal edges or data
normal_edge_idx = vedf[~vedf['vals'].isna() & ~(vedf['vals'].shift(1).isna() & vedf['vals'].shift(-1).isna())].index
# index of outlier edge: not normal edges and nans
out_edge_idx = vedf[(~vedf.index.isin(normal_edge_idx))&(~vedf['vals'].isna())].index
# check if there is atleast one outlier edge
if len(out_edge_idx) > 0:
    # iterate through each outlier edge and drop those edges
    for iso_idx in out_edge_idx: 
        df1 = df1[~((df1[col]>=vedf['edges'].iloc[iso_idx])&(df1[col]<=vedf['edges'].iloc[(iso_idx+1)]))]

#Impact of this solution before and after the dropping the outliers:

Before detecting and filtering the outliers:

After detecting and filtering the outliers:

Upvotes: 1

MuhammedYunus

Reputation: 5010

This approach uses a for loop. For each bin, it checks whether the bin meets 3 criteria: (1) the current bin has a value > 0 and < 50, and (2) the bin to the left is empty (or no left bin), and (3) the bin to the right is also empty (or no right bin). If all these conditions are met, it flags the current bin as being isolated.

# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is 
vals = [    38   ,   0,      0  ,    0   ,   0     , 0  ,    0  ,    0  ,    0   ,   0,
      0     , 0     , 0 ,     0   ,   0   ,   1    , 11, 12.6664  ,13.853,   4.536]

edges = [ 0. ,       2.911165,  5.82233 ,  8.733495, 11.64466 , 14.555825 ,17.46699,
 20.378155 ,23.28932  ,26.200485 ,29.11165  ,32.022815, 34.93398  ,37.845145,
 40.75631 , 43.667475 ,46.57864 , 49.489805, 52.40097  ,55.312135, 58.2233  ]

plt.stem(edges[:-1], vals)
is_isolated = []
for bin_idx in range(len(vals)):
    has_left_bin = True if bin_idx > 0 else False
    has_right_bin = True if bin_idx < len(vals) - 1 else False
    
    if (has_left_bin and vals[bin_idx - 1]==0) or not has_left_bin:
        left_empty = True
    else:
        left_empty = False
        
    if (has_right_bin and vals[bin_idx + 1]==0) or not has_right_bin:
        right_empty = True
    else:
        right_empty = False
        
    if (0 < vals[bin_idx] < 50) and left_empty and right_empty:
        is_isolated.append(True)
    else:
        is_isolated.append(False)
    

vdef = pd.DataFrame({'vals': vals, 'edges': edges[:-1], 'is_isolated': is_isolated})
vdef

Upvotes: 1

Python Detect isolated edges in the histogram plot for outliers detection in time-series data

Answers (2)

Related Questions