Matt
Matt

Reputation: 85

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.

I'm attempting this with Quanteda and have the following code:

library(quanteda)

bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)

It seems to work smoothly until predict(), which gives:

Error in newdata %*% log.lik : 
  requires numeric/complex matrix/vector arguments

Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!

Here is a link to the dataset.

Upvotes: 2

Views: 1118

Answers (1)

Adam Obeng
Adam Obeng

Reputation: 1542

As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:

library("quanteda")

text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)

all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)

You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:

bbc_pred <- predict(bbcNb)

Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:

library(caret)

confusionMatrix(
    bbc_pred$docs$predicted[1781:2225],
    all_classes[1781:2225]
)

However, as @ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:

docvars(bbc_corpus)$category <- factor(
    ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)

(note that this must be done before you extract all_classes from bbc_corpus above).

Upvotes: 4

Related Questions