RPresle
RPresle

Reputation: 2561

Preprocess large datafile with categorical and continuous features

First thanks for reading me and thanks a lot if you can give any clue to help me solving this.

As I'm new to Scikit-learn, don't hesitate to provide any advice that can help me to improve the process and make it more professional.

My goal is to classify data between two categories. I would like to find a solution that would give me the most precise result. At the moment, I'm still looking for the most suitable algorithm and data preprocessing.

In my data I have 24 values : 13 are nominal, 6 are binarized and the others are continuous. Here is an example of a line

"RENAULT";"CLIO III";"CLIO III (2005-2010)";"Diesel";2010;"HOM";"_AAA";"_BBB";"_CC";0;668.77;3;"Fevrier";"_DDD";0;0;0;1;0;0;0;0;0;0;247.97

I have around 900K lines for learning and I do my test over 100K lines

As I want to compare several algorithm implementations, I wanted to encode all the nominal values so it can be used in several Classifier.

I tried several things:

  1. LabelEncoder : this was quite good but it gives me ordered values that would be miss-interpreted by the classifier.
  2. OneHotEncoder : if I understand well, it is quite perfect for my needs because I could select the column to binarize. But as I have a lot of nominal values, it always goes in MemoryError. Moreover, its input must be numerical so it is compulsory to LabelEncode everything before.
  3. StandardScaler : this is quite useful but not for what I need. I decided to integrate it to scale my continuous values.
  4. FeatureHasher : first I didn't understand what it does. Then, I saw that it was mainly used for Text analysis. I tried to use it for my problem. I cheated by creating a new array containing the result of the transformation. I think it was not built to work that way and it was not even logical.
  5. DictVectorizer : could be useful but looks like OneHotEncoder and put even more data in memory.
  6. partial_fit : this method is given by only 5 classifiers. I would like to be able to do it with Perceptron, KNearest and RandomForest at least so it doesn't match my needs

I looked on the documentation and found these information on the page Preprocessing and Feature Extraction.

I would like to have a way to encode all the nominal values so that they will not be considered as ordered. This solution can be applied on large datasets with a lot of categories and weak resources.

Is there any way I didn't explore that can fit my needs?

Thanks for any clue and piece of advice.

Upvotes: 9

Views: 13254

Answers (1)

Erik
Erik

Reputation: 21

To convert unordered categorical features you can try get_dummies in pandas, more details can refer to its documentation. Another way is to use catboost, which can directly handle categorical features without transforming them into numerical type.

Upvotes: 1

Related Questions