Reputation: 9
I am using Isolation Forest to identify anomalies in a very large data frame. The data is noisy, so I have conducted many filtering operations to smooth out the noise so that the true anomalies present in the data stand out. I then used .diff() on this data set to create a straight line that spikes when an anomaly occurs. Isolation Forest is then used to identify these anomalies.
My issue is that Isolation Forest is identifying the anomaly at the earliest point it can detect an anomaly from occurring, but I need it to detect it at the peak difference.
df["Ref Wt. Denoised"] = denoise(df["Ref Wt."].values, level=2)
df["Ref Wt. Savgol"] = apply_savgol_filter(df["Ref Wt. Denoised"], window_length=101, polyorder=3)
df["Ref Wt. Smoothed"] = df["Ref Wt. Savgol"].rolling(window=indexer).mean()
df["Ref Wt. Diff"] = df["Ref Wt. Smoothed"].diff(periods=300).fillna(0)
df["WOB Anomaly"] = detect_wob.predict(df["Ref Wt. Diff"].values.reshape(-1, 1))
df["WOB Zero Event"] = df["WOB Anomaly"] == -1
I have played around using .shift() to fix it, but this manual change works for some values but not all. I really want to avoid changing the window size that I use to smooth the data over because this severely affects accuracy.
Image of Issue and Fix I'm Looking For
Upvotes: 0
Views: 62
Reputation: 347
If you could define a threshold, there's a potential for you to find the peaks and then test for an anomaly given a set of peaks:
from scipy.signal import find_peaks
peaks, _ = find_peaks(df["Ref Wt. Diff"], height=threshold)
df["Peak Indicator"] = 0 # init the Peak Indicator column
df.loc[peaks, "Peak Indicator"] = 1 # Mark the peaks
peak_data = df[df["Peak Indicator"] == 1]
if not peak_data.empty:
df["WOB Anomaly"] = np.nan # init anomaly column
df.loc[peak_data.index, "WOB Anomaly"] = detect_wob.predict(peak_data["Ref Wt. Diff"].values.reshape(-1, 1))
Upvotes: 0