Reputation: 681
I'm engaged in anomaly detection utilizing machine learning techniques. My specific challenge involves handling a timeseries dataset with cumulative values. To elaborate, each data point's value represents a cumulative sum; for instance, if the value is 100 today, it would be 150 tomorrow (without decrementing). My objective is to identify anomalous values within this dataset and pinpoint the exact timing of these anomalies.
Given the nature of timeseries data, I adopted a distinct approach for data splitting. Instead of random division, I allocated the initial 80% of data for training purposes and the remaining for testing. Due to the cumulative nature of the values, it's important to note that all values in the test dataset are higher than those in the training dataset. This is because the test values fall outside the range of the training dataset values.
In order to reach a solution, I planned to use two algorithms: IsolationForest and LOF (Local Outlier Detection). Unfortunately, the outcomes haven't met my expectations. This is mainly due to the fact that all values in the test dataset are outside the range of the training dataset. To provide an example, if the training dataset values start from 1 and extend to 100, the test dataset values start from 101 and extend to 120. As a result, my prediction results in a uniform -1 value for all entries in the test dataset.
It's worth mentioning that in my scenario, I'm unable to directly convert the cumulative values back to their original values. While this conversion is feasible during development, the situation changes when transitioning the model into production. In the production environment, I only have access to a single record for prediction. To retrieve the actual value, I would require access to the previous record.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
# Generate example data (same as before)
np.random.seed(42)
num_entries = 100
timestamps = pd.date_range(start='2023-01-01', periods=num_entries, freq='D')
cumulative_values = np.sort(np.random.randint(0, 1000, num_entries))
cumulative_values_with_anomalies = cumulative_values.copy()
cumulative_values_with_anomalies[20] = 1500 # Introduce an anomaly (For training dataset)
cumulative_values_with_anomalies[88] = 2000 # Introduce another anomaly (For test dataset)
# Create DataFrame
data = {
'timestamp': timestamps,
'value': cumulative_values_with_anomalies
}
df = pd.DataFrame(data)
# Detect anomalies
anomaly_indices = np.where(cumulative_values_with_anomalies > cumulative_values)[0]
# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['value'], marker='.', label='Data')
plt.scatter(df['timestamp'][anomaly_indices], df['value'][anomaly_indices], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# Manually set the cutoff index for training and test data
cutoff_index = int(len(df) * 0.8) # Use 80% of the data for training
train_df = df.iloc[:cutoff_index]
test_df = df.iloc[cutoff_index:]
# Reshape data for Isolation Forest
X_train = train_df['value'].values.reshape(-1, 1)
# Train Isolation Forest
clf = IsolationForest(contamination=0.1, random_state=42) # Adjust contamination based on your data
clf.fit(X_train)
# Predict anomalies on test data
X_test = test_df['value'].values.reshape(-1, 1)
test_df['anomaly'] = clf.predict(X_test)
anomalies = test_df[test_df['anomaly'] == -1]
# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(test_df['timestamp'], test_df['value'], marker='.', label='Data')
plt.scatter(anomalies['timestamp'], anomalies['value'], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
Code Illustration
In the above, you'll find a snippet of code. The DataFrame includes both timestamps and cumulative values. For testing purposes, I've introduced anomalies into this DataFrame. As I mentioned earlier, I used the IsolationForest algorithm to find the anomalies. it predicts all values as anomalies.
Can anyone suggest potential solutions?
Upvotes: 0
Views: 439
Reputation: 5399
What you could try is detrending the data. Using the training data from your development environment, fit the trend line:
trend = LinearRegression().fit(train_df[['timestamp']].astype(int), train_df.value)
This learns the linear relationship between the timestamp (converted to a unix timestamp, i.e. nanoseconds after epoch) and the value. We'll assume this trend will continue in the future, so also holds for the test data.
We can now remove the (learned) trend from both the training and test data:
train_df['value_detrended'] = train_df['value'] - trend.predict(train_df[['timestamp']].astype(int))
test_df['value_detrended'] = test_df['value'] - trend.predict(test_df[['timestamp']].astype(int))
Then we can proceed with the anomaly detection as usual. Below is your code, modified to include the trend fitting:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
# Generate example data (same as before)
np.random.seed(42)
num_entries = 100
timestamps = pd.date_range(start='2023-01-01', periods=num_entries, freq='D')
cumulative_values = np.sort(np.random.randint(0, 1000, num_entries))
cumulative_values_with_anomalies = cumulative_values.copy()
cumulative_values_with_anomalies[20] = 1500 # Introduce an anomaly (For training dataset)
cumulative_values_with_anomalies[88] = 2000 # Introduce another anomaly (For test dataset)
# Create DataFrame
data = {
'timestamp': timestamps,
'value': cumulative_values_with_anomalies
}
df = pd.DataFrame(data)
# Detect anomalies
anomaly_indices = np.where(cumulative_values_with_anomalies > cumulative_values)[0]
# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['value'], marker='.', label='Data')
plt.scatter(df['timestamp'][anomaly_indices], df['value'][anomaly_indices], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
# Manually set the cutoff index for training and test data
cutoff_index = int(len(df) * 0.8) # Use 80% of the data for training
train_df = df.iloc[:cutoff_index].copy()
test_df = df.iloc[cutoff_index:].copy()
trend = LinearRegression().fit(train_df[['timestamp']].astype(int), train_df.value)
train_df['value_detrended'] = train_df['value'] - trend.predict(train_df[['timestamp']].astype(int))
# Reshape data for Isolation Forest
X_train = train_df['value_detrended'].values.reshape(-1, 1)
# Train Isolation Forest
clf = IsolationForest(contamination=0.05, random_state=42) # Adjust contamination based on your data
clf.fit(X_train)
# Predict anomalies on test data
test_df['value_detrended'] = test_df['value'] - trend.predict(test_df[['timestamp']].astype(int))
X_test = test_df['value_detrended'].values.reshape(-1, 1)
test_df['anomaly'] = clf.predict(X_test)
anomalies = test_df[test_df['anomaly'] == -1]
# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(test_df['timestamp'], test_df['value_detrended'], marker='.', label='Data')
plt.scatter(anomalies['timestamp'], anomalies['value_detrended'], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
In your example, the trend was indeed linear. Perhaps in your real data, it isn't. In that case you can try fitting a non-linear trend line.
Upvotes: 1