katdataecon
katdataecon

Reputation: 185

Remove 2 stopwords lists with Quanteda package R

I'm working with quanteda package on a corpus dataframe, and here is the basic code i use :

library(quanteda)

fmsi_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

However, i have another stopowords list as a data frame, called stpw, that i'd like to take into account.

I tried :

fmsi_des <- dfm(corpus_des, remove=stopwords("spanish","stpw"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

Error in stopwords("spanish", "stpw") : unused argument ("stpw")

Then i created a list with the stopwords of "spanish" + the stopwords of stpw :

all_stops <- c("bogota","vias","medellin","valle","departamento",stopwords("spanish"))

fmsi_des <- dfm(corpus_des, remove=stopwords("all_stops"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)

Error in stopwords("all_stops") : no stopwords available for 'all_stops'

I also created a txt file with my stopwords, in order to try that :

library(tm)

stopwords = readLines('stpw.txt') 
x  = fd$contract_description        
x  =  removeWords(x,stopwords)

des <- subset(x, !is.na(x))
corpus_des <- corpus(des$fd.contract_description)
fmsi_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
                remove_punct=TRUE, remove_numbers=TRUE)
     

Warning message: In readLines("stp.txt") : Incomplete final line found in 'stpw.txt'

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : incorrect regular expression '(*UCP)\b(bogota|vias|medellin|valle|departamento|+)\b' In addition : Warning message: In gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), : PCRE pattern compilation error 'nothing to repeat' at '+)\b'

Upvotes: 0

Views: 452

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

This is a case where knowing the value of return objects in R is the key to obtaining the result you want. Specifically, you need to know what stopwords() returns, as well as what it is expected as its first argument.

stopwords(language = "sp") returns a character vector of Spanish stopwords, using the default source = "snowball" list. (See ?stopwords for full details.)

So if you want to remove the default Spanish list plus your own words, you concatenate the returned character vector with additional elements. This is what you have done in creating all_stops.

So to remove all_stops -- and here, using the quanteda v3 suggested usage -- you simply do the following:

fmsi_des <- corpus_des %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
    tokens_remove(pattern = all_stops) %>%
    dfm()

Upvotes: 1

Related Questions