I am working on a multi class classification for segmenting customers into 3 different classes based on their purchasing behavior and demographics. I cannot disclose the data set completely but in general it contains around 300 features and 50000 rows. I have tried the following methods but I am unable to achieve accuracy above 50% : Tuning the hyperparameters ( I am using tuned hyperparameters after doing GridSearchCV) Normalizing the dataset and then running my models Tried different classification methods : OneVsRestClassifier, RandomForestClassification, SVM, KNN and LDA I have also removed irrelevant features and tried running my models My classes were imbalanced, so I have also tried using class_weight = balanced, oversampling using SMOTE, downsampling and resampling. Is there something else I can try to improve performance (f-score, precision and recall)?

Try to tune below parameters n_estimators This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable. As your data size is bigger so it will take more time for each iteration but try this. max_features These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features. Few of them are : Auto/None : This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on the individual tree. sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features. 0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered. min_sample_leaf Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.

pythonmachine-learningscikit-learnrandom-forest

code_crusher

Reputation: 57

How to improve performance of random forest multiclass classification model?

I am working on a multi class classification for segmenting customers into 3 different classes based on their purchasing behavior and demographics. I cannot disclose the data set completely but in general it contains around 300 features and 50000 rows. I have tried the following methods but I am unable to achieve accuracy above 50% :

Tuning the hyperparameters ( I am using tuned hyperparameters after doing GridSearchCV)
Normalizing the dataset and then running my models
Tried different classification methods : OneVsRestClassifier, RandomForestClassification, SVM, KNN and LDA
I have also removed irrelevant features and tried running my models
My classes were imbalanced, so I have also tried using class_weight = balanced, oversampling using SMOTE, downsampling and resampling.

Is there something else I can try to improve performance (f-score, precision and recall)?

Upvotes: 0

Answers (3)

LOrD_ARaGOrN

Reputation: 4496

Try to tune below parameters

n_estimators

This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable. As your data size is bigger so it will take more time for each iteration but try this.

max_features

These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features. Few of them are :

Auto/None : This will simply take all the features which make sense
in every tree.Here we simply do not put any restrictions on the individual tree.

sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.

0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered.

min_sample_leaf

Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.

Upvotes: 2

sayo

Reputation: 217

Try doing a feature selection first using PCA or Random forest and then fit a chained classifier where first do a oneversesall and then a random forest or a decision tree. You should get a slightly better accuracy.

Upvotes: 1

CutePoison

Reputation: 5355

How is your training acc? I assume that your acc is your validation. If your training acc is way to high, som normal overfitting might be the case. Random forest normally handles overfitting very well.

What you could try is PCA of your data, and then try classify on that. This gives you the features which accounts for most variation in the data, thus can be a good idea to try, if you cannot classify on the original data (and also it reduces your features).

A side note: remember, that the fitting of SVM is quadratic in the number of points, thus reducing your data to around 10-20000 for tuning parameters and then fit a SVM on the full dataset with the optimal parameter for the subset, might also speed up the process. Also remember to consider trying different kernels for the SVM.