Reputation: 131
I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.
About the dataset:
The dataset has two columns: "Test_name" and "Description"
There are six labels in the Test_Name column and their corresponding description in the "Description" column.
Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.
Using H2o package, build a gbm model.
Four of the class labels are classified well but the rest two are poorly classified.
below is the output:
Extract training frame with `h2o.getFrame("train")`
MSE: (Extract with `h2o.mse`) 0.1197392
RMSE: (Extract with `h2o.rmse`) 0.3460335
Logloss: (Extract with `h2o.logloss`) 0.3245868
Mean Per-Class Error: 0.3791268
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Body Fluid Analysis = 401 / 2,759
Cytology Test = 182 / 1,087
Diagnostic Imaging = 117 / 3,907
Doctors Advice = 32 / 752
Organ Function Test = 461 / 463
Patient Related = 101 / 113
Totals = 1,294 / 9,081
The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?
Upvotes: 0
Views: 1749
Reputation: 466
Just some quick things you can do to improve this:
h2o.grid
:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html)If you provide more details and a working example there is more that can be done to help you.
Upvotes: 0