Reputation: 6755
I am interested in replacing all words in a tm
Corpus object according to a dictionary made of a two columns data frame, where the first column is the word to be matched and the second column is the replacement word.
I am stuck with the translate
function. I saw this answer but I can't transform it in a function to be passed to tm_map
.
Please consider the following MWE
library(tm)
docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))
dictionary <- data.frame(word = c('first', 'second', 'text'),
translation = c('primo', 'secondo', 'testo'))
translate <- function(text, dictionary) {
# Would like to replace each word of text with corresponding word in dictionary
}
corp_translated <- tm_map (corp, translate)
inspect(corp_translated)
# Expected result
# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID
# [[1]]
# primo testo
# [[2]]
# secondo testo
Upvotes: 2
Views: 7284
Reputation: 335
In combination with the tm_map
function of the tm
package, you can use stri_replace_all_fixed
from package stringi
. For instance:
library(tm)
library(stringi)
docs <- c("first text", "second text")
corp <- Corpus(VectorSource(docs))
word <- c('first', 'second', 'text')
tran <- c('primo', 'secondo', 'testo')
corp <- tm_map(corp, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Upvotes: 3
Reputation: 55380
I would suggest not using a data.frame
for a dictionary, since the basic object in R
, a vector, is a dictionary by default.
dict <- c('primo', 'secondo', 'testo')
names(dict) <- c('first', 'second', 'text')
Then to "tanslate"
x
, where x
might be "second"
, you simply use:
dict[[x]]
You dont even need a wrapper function.
If you want to translate in the opposite direction, use
name(dict)[names(dict) %in% x]
Or you can flip the dictionary
dict.flip <- names(dict)
names(dict.flip) <- dict
Upvotes: 3