Removing words from a corpus of documents with a tailored list of words

Question

The tm package has the ability to let the user 'prune' the words and punctuation in a corpus of documents: tm_map( corpusDocs, removeWords, stopwords("english") )

Is there a way to supply tm_map with a tailored list of words that is read in from a csv file and used in place of stopwords("english")?

Thank you.

BSL

Vezir · Accepted Answer

Lets take a file (wordMappings)

"from"|"to"
###Words######
"this"|"ThIs"
"is"|"Is"
"a"|"A"
"sample"|"SamPle"

First removel of words;

readFile <- function(fileName, seperator) {
  read.csv(paste0("data\", fileName, ".txt"), 
                             sep=seperator, #"	", 
                             quote = """,
                             comment.char = "#",
                             blank.lines.skip = TRUE,
                             stringsAsFactors = FALSE,
                             encoding = "UTF-8")

}

kelimeler <- c("this is a sample")
corpus = Corpus(VectorSource(kelimeler))
seperatorOfTokens <- ' '
words <- readFile("wordMappings", "|")

toSpace <- content_transformer(function(x, from) gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", ' ', seperatorOfTokens), x))
for (word in words$from) {
  corpus <- tm_map(corpus, toSpace, word)
}

If you want a more flexible solution, for example not just removing also replacing with then;

#Specific Transformations
toMyToken <- content_transformer( function(x, from, to)
  gsub(sprintf("(^|%s)%s(%s%s)", seperatorOfTokens, from,'$|', seperatorOfTokens, ')'), sprintf(" %s%s", to, seperatorOfTokens), x))

for (i in seq(1:nrow(words))) {
  print(sprintf("%s -> %s ", words$from[i], words$to[i]))
  corpus <- tm_map(corpus, toMyToken, words$from[i], words$to[i])
}

Now a sample run;

[1] "this -> ThIs "
[1] "is -> Is "
[1] "a -> A "
[1] "sample -> SamPle "
> content(corpus[[1]])
[1] " ThIs Is A SamPle "
>

Removing words from a corpus of documents with a tailored list of words

Answers (2)

Related Questions