Anomaly Detection for Cumulative Timeseries Data

Question

I'm engaged in anomaly detection utilizing machine learning techniques. My specific challenge involves handling a timeseries dataset with cumulative values. To elaborate, each data point's value represents a cumulative sum; for instance, if the value is 100 today, it would be 150 tomorrow (without decrementing). My objective is to identify anomalous values within this dataset and pinpoint the exact timing of these anomalies.

Given the nature of timeseries data, I adopted a distinct approach for data splitting. Instead of random division, I allocated the initial 80% of data for training purposes and the remaining for testing. Due to the cumulative nature of the values, it's important to note that all values in the test dataset are higher than those in the training dataset. This is because the test values fall outside the range of the training dataset values.

In order to reach a solution, I planned to use two algorithms: IsolationForest and LOF (Local Outlier Detection). Unfortunately, the outcomes haven't met my expectations. This is mainly due to the fact that all values in the test dataset are outside the range of the training dataset. To provide an example, if the training dataset values start from 1 and extend to 100, the test dataset values start from 101 and extend to 120. As a result, my prediction results in a uniform -1 value for all entries in the test dataset.

It's worth mentioning that in my scenario, I'm unable to directly convert the cumulative values back to their original values. While this conversion is feasible during development, the situation changes when transitioning the model into production. In the production environment, I only have access to a single record for prediction. To retrieve the actual value, I would require access to the previous record.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# Generate example data (same as before)
np.random.seed(42)
num_entries = 100
timestamps = pd.date_range(start='2023-01-01', periods=num_entries, freq='D')
cumulative_values = np.sort(np.random.randint(0, 1000, num_entries))
cumulative_values_with_anomalies = cumulative_values.copy()
cumulative_values_with_anomalies[20] = 1500  # Introduce an anomaly (For training dataset)
cumulative_values_with_anomalies[88] = 2000  # Introduce another anomaly (For test dataset)

# Create DataFrame
data = {
    'timestamp': timestamps,
    'value': cumulative_values_with_anomalies
}
df = pd.DataFrame(data)

# Detect anomalies
anomaly_indices = np.where(cumulative_values_with_anomalies > cumulative_values)[0]


# Plot the data 
plt.figure(figsize=(10, 6))
plt.plot(df['timestamp'], df['value'], marker='.', label='Data')
plt.scatter(df['timestamp'][anomaly_indices], df['value'][anomaly_indices], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


# Manually set the cutoff index for training and test data
cutoff_index = int(len(df) * 0.8)  # Use 80% of the data for training
train_df = df.iloc[:cutoff_index]
test_df = df.iloc[cutoff_index:]


# Reshape data for Isolation Forest
X_train = train_df['value'].values.reshape(-1, 1)

# Train Isolation Forest
clf = IsolationForest(contamination=0.1, random_state=42)  # Adjust contamination based on your data
clf.fit(X_train)

# Predict anomalies on test data
X_test = test_df['value'].values.reshape(-1, 1)
test_df['anomaly'] = clf.predict(X_test)
anomalies = test_df[test_df['anomaly'] == -1]

# Plot the data
plt.figure(figsize=(10, 6))
plt.plot(test_df['timestamp'], test_df['value'], marker='.', label='Data')
plt.scatter(anomalies['timestamp'], anomalies['value'], color='red', label='Anomaly')
plt.title('Cumulative Values and Anomalies (Isolation Forest)')
plt.xlabel('Timestamp')
plt.ylabel('Cumulative Value')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()

Code Illustration

In the above, you'll find a snippet of code. The DataFrame includes both timestamps and cumulative values. For testing purposes, I've introduced anomalies into this DataFrame. As I mentioned earlier, I used the IsolationForest algorithm to find the anomalies. it predicts all values as anomalies.

Can anyone suggest potential solutions?

Anomaly Detection for Cumulative Timeseries Data

Answers (1)

Related Questions