Reputation: 85
I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
Upvotes: 2
Views: 1118
Reputation: 1542
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars
:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict
. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as @ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes
from bbc_corpus
above).
Upvotes: 4