wahyu ramadani
wahyu ramadani

Reputation: 1

Working with document term matrix in xgboost

I am working on sentiment analysis in r. i've done making a model with naive bayes. but, i wanna try another one, which is xgboost. then, i got a problem when tried to make xgboost model because don't know what to do with my document term matrix in xgboost. Can anyone give me a solution?

i've tried to convert the document term matrix data to data frame. but it doesn't seem to work.

the code below describes how my current train & test data

library(tm)
dtm.tf <- VCorpus(VectorSource(results$text)) %>%
DocumentTermMatrix()

#split 80:20   
all.data <- dtm.tf
train.data <- dtm.tf[1:312,]
test.data <- dtm.tf[313:390,]

and i have xgboost template with another data set :

# install.packages('xgboost')
library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-11]), 
                     label = training_set$Exited, nrounds = 10)

# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)

# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)

i want to use the xgboost template above to make my model using my current train & test data. what i have to do?

Upvotes: 0

Views: 233

Answers (1)

phiver
phiver

Reputation: 23608

You need to transform the document term matrix into a sparse matrix. In your case that can be done via sparseMatrix function from the Matrix package (default with R):

sparse_matrix_tf <-  Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
                                              dims=c(dtm.tf$nrow, dtm.tf$ncol))

Then you can use this to feed it to xgboost and use the label form the dtm.tf.

classifier = xgboost(data = sparse_matrix_tf, 
                     label = dtm.tf$dimnames$Docs,
                     nrounds = 10).

Complete reproducible example below. I leave the splitting into 80 / 20 to you.

library(tm)
library(xgboost)

data("crude")
crude <- as.VCorpus(crude)
dtm.tf <- DocumentTermMatrix(crude)

sparse_matrix_tf <-  Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
                                              dims=c(dtm.tf$nrow, dtm.tf$ncol))

classifier = xgboost(data = sparse_matrix_tf, 
                     label = dtm.tf$dimnames$Docs,
                     nrounds = 10)

Upvotes: 0

Related Questions