Reputation: 11
I have aws cpu-utilization data which NAB used to create Anomaly Detection using AWS- SageMaker Random Cut Forest. i am able to execute it but i need a deeper solution for the Hyper Parameter Tuning. I have gone through the AWS- Documentation but need to understand the Hyper Parameter selection. are the parameters an educated Guess or Do we need to calculate co_disp's mean and standard deviation in order to infer the parameters.
Thanks in Advance.
I have tried 100 Trees and 512/256 tree_size to detect anomalies but how to infer these parameters
# Set tree parameters
num_trees = 50
shingle_size = 48
tree_size = 512
# Create a forest of empty trees
forest = []
for _ in range(num_trees):
tree = rrcf.RCTree()
forest.append(tree)
# Use the "shingle" generator to create rolling window
#temp_data represents my aws_cpuutilization data
points = rrcf.shingle(temp_data, size=shingle_size)
# Create a dict to store anomaly score of each point
avg_codisp = {}
# For each shingle...
for index, point in enumerate(points):
# For each tree in the forest...
for tree in forest:
# If tree is above permitted size, drop the oldest point (FIFO)
if len(tree.leaves) > tree_size:
tree.forget_point(index - tree_size)
# Insert the new point into the tree
tree.insert_point(point, index=index)
"""Compute codisp on the new point and take the average among all
trees"""
if not index in avg_codisp:
avg_codisp[index] = 0
avg_codisp[index] += tree.codisp(index) / num_trees
values =[]
for key,value in avg_codisp.items():
values.append(value)
Upvotes: 1
Views: 474
Reputation: 181
Thanks for your interest in RandomCutForest. If you have labeled anomalies we recommend you use SageMaker Automatic Model Tuning (https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), and let SageMaker find the combination that works best.
Heuristically, if you know that your data has 0.4% of anomalies, for example, you would set the number of samples per tree to N = 1 / (0.4 / 100) = 250. The idea behind this is that each tree represents a sample of your data. Each datapoint in a tree is considered "normal". If your trees have too few points, e.g. 10, then most points will look different than these "normal" ones, i.e. they will have a high anomaly score.
The relation between the number of trees and the underlying data is more complex. As the range of "normal" points grows, you would want to have more trees.
Upvotes: 2