Reputation: 57
I am working on a multi class classification for segmenting customers into 3 different classes based on their purchasing behavior and demographics. I cannot disclose the data set completely but in general it contains around 300 features and 50000 rows. I have tried the following methods but I am unable to achieve accuracy above 50% :
Is there something else I can try to improve performance (f-score, precision and recall)?
Upvotes: 0
Views: 6037
Reputation: 4496
Try to tune below parameters
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable. As your data size is bigger so it will take more time for each iteration but try this.
These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features. Few of them are :
Auto/None : This will simply take all the features which make sense
in every tree.Here we simply do not put any restrictions on the
individual tree.
sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.
0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered.
Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.
Upvotes: 2
Reputation: 217
Try doing a feature selection first using PCA or Random forest and then fit a chained classifier where first do a oneversesall and then a random forest or a decision tree. You should get a slightly better accuracy.
Upvotes: 1
Reputation: 5355
How is your training acc? I assume that your acc is your validation. If your training acc is way to high, som normal overfitting might be the case. Random forest normally handles overfitting very well.
What you could try is PCA of your data, and then try classify on that. This gives you the features which accounts for most variation in the data, thus can be a good idea to try, if you cannot classify on the original data (and also it reduces your features).
A side note: remember, that the fitting of SVM is quadratic in the number of points, thus reducing your data to around 10-20000 for tuning parameters and then fit a SVM on the full dataset with the optimal parameter for the subset, might also speed up the process. Also remember to consider trying different kernels for the SVM.
Upvotes: 1