Preprocess large datafile with categorical and continuous features

Question

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.

As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.

My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.

In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line

"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97

I have around 900K lines for learning and I do my test over 100K lines

As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.

I tried several things:

LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs

I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.

I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.

Is there any way I didn't explore that can fit my needs?

Thanks for any clue and piece of advice.

Preprocess large datafile with categorical and continuous features

Answers (1)

Related Questions