Scott
Scott

Reputation: 1922

Transforming emoji text in R

Doing some text mining in R. I have a corpus in the region of 25k documents. I'm currently cleaning my corpus and as part of the process I'm translating to lower case. My implementation:

createCorpus <- function(corpusData){
    aCorpus <- Corpus(DataframeSource(corpusData))
    ...
    aCorpus <- tm_map(aCorpus,content_transformer(tolower))
}

However, for any document text that contains emojis I'm getting the following error. Note, I've removed the actual text.

Error in FUN(content(x), ...) : invalid input '...' in 'utf8towcs'

Now, I've tried adding str_replace_all(aCorpus$content,"[^[:graph:]]", " ") before transforming to lower case as suggested in this answer. This produces exactly the same error as above, almost as if it hasn't actually done anything.

I have also tried tm_map(aCorpus, function(x) iconv(enc2utf8(x), sub = "byte")) as suggested here, which yields the error:

Error in enc2utf8(x) : argument is not a character vector

I feel like str_replace_all() is the correct approach but I must be doing something wrong? How can I remove all emoji characters so that I can clean my corpus?

EDIT For clarification, the parameter passed to the function is a single column data-frame, where each row is a separate document.

Upvotes: 3

Views: 1036

Answers (1)

Scott
Scott

Reputation: 1922

I managed to solve the issue using:

tm_map(aCorpus, function(x) iconv(enc2utf8(x$content), sub = "byte"))

In place of:

tm_map(aCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))

The issue is that I had to refer directly to the content of the corpus, not just the corpus itself. Achieved by using x$content as the parameter rather than merely x.

Upvotes: 4

Related Questions