Reputation: 168
I have been trying to tokenise and clean my 400 txt documents before using structured topic modelling (STM). I wanted to remove punctuations, stopwords, symbols, etc. However, I get the following error message: "Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’". This is my original code:
answers2 <- tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 1L, verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
I also tried to tokenize a simple string text - just to check if it was an encoding problem while importing my txt files - but I got the same error message, plus a couple of extra ones when I tried to tokenise the the text directly, without converting it to corpus: "Error: Unable to locate Ciao bella ciao" and "Error: No language specified!". Here is my example code in case someone wants to replicate the error message:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- tokens(prova2_corpus)
prova2_tok <- tokens(prova_corpus)
The packages that are loaded are: data.table, ggplot2, quanteda, readtext, stm, stringi, stringr, tm, textstem. Any suggestion on how I could proceed to tokenise and clean my texts?
Upvotes: 1
Views: 725
Reputation: 168
After several attempts, I managed to find a solution. When several text analysis/topic modelling packages are loaded in Rstudio, the "tokens" functions can overlap. You need to force the command to be quantedas "tokens", ie quanteda::tokens(answers). Here is the updated code
answers2 <- quanteda::tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
And the updated example code too:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- quanteda::tokens(prova2_corpus)
prova2_tok <- quanteda::tokens(prova_corpus)
Upvotes: 1