Multiclass text classification using R

Question

I am working on a multiclass text classification problem. I have build a gradient boosting model for the same.

About the dataset:

The dataset has two columns: "Test_name" and "Description"

There are six labels in the Test_Name column and their corresponding description in the "Description" column.

My approach towards the problem

DATA PREPARATION

Creat a word vector for description.
Build a corpus using the word vector.
Pre-processing tasks such as removing number, whitespaces, stopwords and conversion to lower case.
Build a document term matrix (dtm).
Remove sparse words from the above dtm.
The above step leads to a count frequency matrix showing the frequency of each word in its coressponding column.
Tranform count frequency matrix to a binary instance matrix, which shows occurences of a word in a document as either 0 or 1, 1 for being present and 0 for absent.
Append the label column from the original notes dataset with the transformed dtm. The label column has 6 labels.

Model Building

Using H2o package, build a gbm model.

Results obtained

Four of the class labels are classified well but the rest two are poorly classified.

below is the output:

Extract training frame with `h2o.getFrame("train")`
MSE: (Extract with `h2o.mse`) 0.1197392
RMSE: (Extract with `h2o.rmse`) 0.3460335
Logloss: (Extract with `h2o.logloss`) 0.3245868
Mean Per-Class Error: 0.3791268
Confusion Matrix: Extract with `h2o.confusionMatrix(,train = TRUE)`)

Body Fluid Analysis =   401 / 2,759
Cytology Test       =   182 / 1,087
Diagnostic Imaging  =   117 / 3,907
Doctors Advice      =      32 / 752
Organ Function Test =     461 / 463
Patient Related     =     101 / 113
Totals              = 1,294 / 9,081

The misclassification errors for organ function test and patient related are relatively higher. How can i fix this?

Multiclass text classification using R

My approach towards the problem

DATA PREPARATION

Model Building

Results obtained

Answers (1)

Related Questions