Regression Model for categorical data

Question

I have very large dataset in csv file (1,700,000 raws and 300 sparse features). - It has a lot of missing values. - the data varies between numeric and categoral values. - the dependant variable (the class) is binary (either 1 or 0). - the data is highly skewed, the number of positive response is low.

Now what is required from me is to apply regression model and any other machine learning algorithm on this data.

I'm new on this and I need help.. -how to deal with categoral data in case of regression model? and does the missing values affects too much on it? - what is the best prediction model i can try for large, sparse, skewed data like this? - what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.

I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.

Thank you for help in advance

Regression Model for categorical data

Answers (1)

Related Questions