Reputation: 221
I have very large dataset in csv file (1,700,000 raws and 300 sparse features). - It has a lot of missing values. - the data varies between numeric and categoral values. - the dependant variable (the class) is binary (either 1 or 0). - the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help.. -how to deal with categoral data in case of regression model? and does the missing values affects too much on it? - what is the best prediction model i can try for large, sparse, skewed data like this? - what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
Thank you for help in advance
Upvotes: 2
Views: 1179
Reputation: 28492
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week
with 7 possible values you may substitute it with 7 binary attributes - Sunday
, Monday
, etc. Of course, in each particular instance only one of 7 attributes may hold value 1
and all others have to be 0
.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.
Upvotes: 2