Reputation: 1922
Doing some text mining in R. I have a corpus in the region of 25k documents. I'm currently cleaning my corpus and as part of the process I'm translating to lower case. My implementation:
createCorpus <- function(corpusData){
aCorpus <- Corpus(DataframeSource(corpusData))
...
aCorpus <- tm_map(aCorpus,content_transformer(tolower))
}
However, for any document text that contains emojis I'm getting the following error. Note, I've removed the actual text.
Error in FUN(content(x), ...) : invalid input '...' in 'utf8towcs'
Now, I've tried adding str_replace_all(aCorpus$content,"[^[:graph:]]", " ")
before transforming to lower case as suggested in this answer. This produces exactly the same error as above, almost as if it hasn't actually done anything.
I have also tried tm_map(aCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
as suggested here, which yields the error:
Error in enc2utf8(x) : argument is not a character vector
I feel like str_replace_all()
is the correct approach but I must be doing something wrong? How can I remove all emoji characters so that I can clean my corpus?
EDIT For clarification, the parameter passed to the function is a single column data-frame, where each row is a separate document.
Upvotes: 3
Views: 1036
Reputation: 1922
I managed to solve the issue using:
tm_map(aCorpus, function(x) iconv(enc2utf8(x$content), sub = "byte"))
In place of:
tm_map(aCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))
The issue is that I had to refer directly to the content of the corpus, not just the corpus itself. Achieved by using x$content
as the parameter rather than merely x
.
Upvotes: 4