user3853906
user3853906

Reputation: 9

Predicting from a highly skewed dataset

I would like to find the factors that contribute to a particular event happening. However that event occurs only about 1% of the time. So if I have a class attribute called event_happened, 99% of the time the value is 0, and 1 only 1% of the time. Traditional data mining predictions techniques (decision tree, naive bayes etc) don't seem to be working in this case. Any suggestions as to how should go about mining this dataset? Thanks.

Upvotes: 0

Views: 1201

Answers (3)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77475

This is an unbalanced classification problem.

I'm pretty sure I have seen some surveys and overview articles on methods that can handle unbalanced data well. You should research this term ("skew" is a bit broad, and may not get you the results you are looking for).

Upvotes: 0

Mateva
Mateva

Reputation: 812

lets say my attributes are hour_of_the day, day_of_the_week, state, customer_age, customer_gender etc. And I want to find out which of these factors contribute to my event occurring.

Based on this answer, I believe you need classification, but your result will be the model itself.

So, you perform, say, logistic regression, but your features are the data attributes themselves(some literature doesn't even separate features and attributes).

You have to somehow normalize this data. This can be tricky. I would go for boolean features(say hour_of_event==00, hour_of_event==01, hour_of_event==02,...)

Then, you apply any classification model, you end up with weights against each of the attributes. The attributes with (the highest weights will be the factors that you need).

Upvotes: 0

Mateva
Mateva

Reputation: 812

This is the typical description of the task Anomaly detection task It defines its own group of algorithms:

In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

And a statement about the possible approaches:

Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learned model.

What you would choose is a question of personal flavor.

These approaches will help "learn" to find out outlier events; then the model that "predicts" them will define the factors that you are interested in.

Upvotes: 3

Related Questions