Reputation: 4564
I am attempting to find out outliers my own way. How? Plot the histogram, search for isolated edges with a few counts and zero-count neighbors or edges. Usually they will be at the far end of the histogram. Those could be outliers. Detect and drop them. What kind of data is it? Time-series coming from the field. Sometimes, you would see weird numbers (while sensors data is around 50-100, outliers may be -10000, 1000) when the sensors fail to communicate data in time and the data loggers stores these weird numbers. They are momentary, may occur a few times in a year data and would be less than 1 % of total samples.
My code:
# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [ 38 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 11 126664 13853 4536]
edges = [ 0. 2.911165 5.82233 8.733495 11.64466 14.555825 17.46699
20.378155 23.28932 26.200485 29.11165 32.022815 34.93398 37.845145
40.75631 43.667475 46.57864 49.489805 52.40097 55.312135 58.2233 ]
# repeat last sample twice in the vals. Why: because vals always have one sample less than edges
vals = np.append(vals, vals[-1])
vedf = pd.DataFrame(data = {'edges':edges,'vals':vals})
# Replace all zero samples with NaN. Hence, these rows will not recognized.
vedf['vals'] = vedf['vals'].replace(0,np.nan)
# Identify the isolated edges by looking the number of samples, say, < 50
vedf['IsolatedEdge?'] = vedf['vals'] <50
# plot histogram
plt.plot(vedf['edges'],vedf['vals'],'o')
plt.show()
Present output:
This is not a correct output. Why? There is only one isolated edge at the beginning at value 0. However, here, my code detected values at 43 and 46 as isolated ones just because they have less count.
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 True
16 46.578640 11.0 True
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
Expected output:
vedf =
edges vals IsolatedEdge?
0 0.000000 38.0 True
1 2.911165 NaN False
2 5.822330 NaN False
3 8.733495 NaN False
4 11.644660 NaN False
5 14.555825 NaN False
6 17.466990 NaN False
7 20.378155 NaN False
8 23.289320 NaN False
9 26.200485 NaN False
10 29.111650 NaN False
11 32.022815 NaN False
12 34.933980 NaN False
13 37.845145 NaN False
14 40.756310 NaN False
15 43.667475 1.0 False
16 46.578640 11.0 False
17 49.489805 126664.0 False
18 52.400970 13853.0 False
19 55.312135 4536.0 False
20 58.223300 4536.0 False
Once, I know a specific edge is isolated one, I can drop all the samples in the edge.
Upvotes: 0
Views: 81
Reputation: 4564
I am posting here my complete solution after the @Mark suggestion:
# index of normal edges or data
normal_edge_idx = vedf[~vedf['vals'].isna() & ~(vedf['vals'].shift(1).isna() & vedf['vals'].shift(-1).isna())].index
# index of outlier edge: not normal edges and nans
out_edge_idx = vedf[(~vedf.index.isin(normal_edge_idx))&(~vedf['vals'].isna())].index
# check if there is atleast one outlier edge
if len(out_edge_idx) > 0:
# iterate through each outlier edge and drop those edges
for iso_idx in out_edge_idx:
df1 = df1[~((df1[col]>=vedf['edges'].iloc[iso_idx])&(df1[col]<=vedf['edges'].iloc[(iso_idx+1)]))]
#Impact of this solution before and after the dropping the outliers:
Before detecting and filtering the outliers:
After detecting and filtering the outliers:
Upvotes: 1
Reputation: 5010
This approach uses a for
loop. For each bin, it checks whether the bin meets 3 criteria: (1) the current bin has a value > 0 and < 50, and (2) the bin to the left is empty (or no left bin), and (3) the bin to the right is also empty (or no right bin). If all these conditions are met, it flags the current bin as being isolated.
# vals, edges = np.histogram(df['column'],bins=20)
# obtained result is
vals = [ 38 , 0, 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0,
0 , 0 , 0 , 0 , 0 , 1 , 11, 12.6664 ,13.853, 4.536]
edges = [ 0. , 2.911165, 5.82233 , 8.733495, 11.64466 , 14.555825 ,17.46699,
20.378155 ,23.28932 ,26.200485 ,29.11165 ,32.022815, 34.93398 ,37.845145,
40.75631 , 43.667475 ,46.57864 , 49.489805, 52.40097 ,55.312135, 58.2233 ]
plt.stem(edges[:-1], vals)
is_isolated = []
for bin_idx in range(len(vals)):
has_left_bin = True if bin_idx > 0 else False
has_right_bin = True if bin_idx < len(vals) - 1 else False
if (has_left_bin and vals[bin_idx - 1]==0) or not has_left_bin:
left_empty = True
else:
left_empty = False
if (has_right_bin and vals[bin_idx + 1]==0) or not has_right_bin:
right_empty = True
else:
right_empty = False
if (0 < vals[bin_idx] < 50) and left_empty and right_empty:
is_isolated.append(True)
else:
is_isolated.append(False)
vdef = pd.DataFrame({'vals': vals, 'edges': edges[:-1], 'is_isolated': is_isolated})
vdef
Upvotes: 1