Vale Baia
Vale Baia

Reputation: 168

Quanteda: error message while tokenizing "unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’"

I have been trying to tokenise and clean my 400 txt documents before using structured topic modelling (STM). I wanted to remove punctuations, stopwords, symbols, etc. However, I get the following error message: "Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’". This is my original code:

answers2 <- tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
   remove_symbols = TRUE, remove_separators = TRUE,
   remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
   ngrams = 1L, verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")

I also tried to tokenize a simple string text - just to check if it was an encoding problem while importing my txt files - but I got the same error message, plus a couple of extra ones when I tried to tokenise the the text directly, without converting it to corpus: "Error: Unable to locate Ciao bella ciao" and "Error: No language specified!". Here is my example code in case someone wants to replicate the error message:

prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- tokens(prova2_corpus)
prova2_tok <- tokens(prova_corpus)

The packages that are loaded are: data.table, ggplot2, quanteda, readtext, stm, stringi, stringr, tm, textstem. Any suggestion on how I could proceed to tokenise and clean my texts?

Upvotes: 1

Views: 725

Answers (1)

Vale Baia
Vale Baia

Reputation: 168

After several attempts, I managed to find a solution. When several text analysis/topic modelling packages are loaded in Rstudio, the "tokens" functions can overlap. You need to force the command to be quantedas "tokens", ie quanteda::tokens(answers). Here is the updated code

answers2 <- quanteda::tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
   remove_symbols = TRUE, remove_separators = TRUE,
   remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
   verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")

And the updated example code too:

prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- quanteda::tokens(prova2_corpus)
prova2_tok <- quanteda::tokens(prova_corpus)

Upvotes: 1

Related Questions