Reputation: 141
We are an operational Data Warehouse. We have a Batch Control table that holds the following information
We are thinking of building an alerting mechanism that would basically raise an alert in any of the following anomalous situations(by analyzing past data):
What would be the best way to solve this problem. Is this a candidate for machine learning?
Upvotes: 1
Views: 251
Reputation: 21085
1) Start with human learning first.
Plot the time series of the processed files (row number as a function of the processing time).
Plot the distribution (or histogram) of the row numbers
Plot the distribution (or histogram) of the time interval between the processing
Plot the function of the processing time dependent on the row count.
Do you see some patterns in time series? Is there a time of a day resp. weekday periodicity? Do you see a trend?
Is the function of the processing time a continuous function or simple a scatter plot of points?
The clearly you see some pattern and thresholds, the easily you will be able to implement the checks.
2) collect the history of failures
Extend your metadata with the status of processing - OK or cause of failure. This will enable to draw your graphics in two colors (positive / negative cases) and help you to decide if you can draw threshold lines (curves) between them.
You will also be able to apply supervised learning algorithms
Upvotes: 0
Reputation: 51553
Is this a candidate for machine learning?
Not really. You have to store certain information about the files processed into the data warehouse.
A file did not arrive.
You have to have a schedule of files listing when the next file is due to arrive. If the file is more than 5 days late, the scheduler raises an alarm, sends an email, or whatever.
A file took more than the average time to process.
Keep the average processing time in your schedule of files. When the current processing time is more than 2 standard deviations from the average processing time, raise an alarm.
A file has very low or very high row row counts
Keep the average row count in your schedule of files. When the current row count is more than 2 standard deviations from the average row count, raise an alarm.
I chose 2 standard deviations from the mean as an exceptional case. Feel free to adjust this to be more or less sensitive to deviations.
Upvotes: 1