Kate
Kate

Reputation: 512

Get R to keep UTF-8 Codepoint representation

This question is related to the utf-8 package for R. I have a weird problem in which I want emojis in a data set I'm working with to stay in code point representation (i.e. as '\U0001f602'). I want to use the 'FindReplace' function from the Data Combine package to turn UTF-8 encodings into prose descriptions of emojis in a dataset of YouTube comments (using a dictionary I made available here). The only issue is that when I 'save' the output as an object in R the nice utf-8 encoding generated by utf8_encode for which I can use my dictionary, it disappears...

First I have to adjust the dictionary a bit:

emojis$YouTube <- tolower(emojis$Codepoint)

emojis$YouTube <- gsub("u\\+","\\\\U000", emojis$YouTube)

Convert to character so as to be able to use utf8_encode:

emojimovie$test <- as.character(emojimovie$textOriginal)

This works great, gives output of \U0001f595 (etc.) that can be matched with dictionary entries when it 'prints' in the console.

utf8_encode(emojimovie$test)

BUT, when I do this:

emojimovie$text2 <- utf8_encode(emojimovie$test)

and then:

emoemo <- FindReplace(data = emojimovie, Var = "text2", replaceData = emojis, from = "YouTube", to = "Name", exact = TRUE)

I get all NAs. When I look at the output in $text2 with View I don't see the \U0001f595, I see actual emojis. I think this is why the FindReplace function isn't working -- when it gets saved to an object it just gets represented as emojis again and the function can't find any matches. When I try gsub("\U0001f602", "lolface", emojimovie$text2), however, I can actually match and replace things, but I don't want to do this for all ~2,000 or so emojis.... I've tried reading as much as I can about utf-8, but I can't understand why this is happening. I'm stumped! :P

Upvotes: 0

Views: 242

Answers (1)

Patrick Perry
Patrick Perry

Reputation: 1482

It looks like in the above, you are trying to convert the UTF-8 emoji to a text version. I would recommend going the other direction. Something like

emojis <- read.csv('Emoji Dictionary 2.1.csv', stringsAsFactors = FALSE)

# change U+1F469 U+200D U+1F467 to \U1F469\U200D\U1F467
escapes <- gsub("[[:space:]]*\\U\\+", "\\\\U", emojis$Codepoint)

# convert to UTF-8 using the R parser
codes <- sapply(parse(text = paste0("'", escapes, "'"),
                      keep.source = FALSE), eval)

This will convert the text representations like U+1F469 to UTF-8 strings. Then, you can search for these strings in the original data.

Note: If you are using Windows, make sure you have the latest release of R; in older versions, the parser gives you the wrong result for strings litke "\U1F469".

The utf8::utf8_encode should really only be used if you have UTF-8 and are trying to print it to the screen.

Upvotes: 1

Related Questions