K.Hua
K.Hua

Reputation: 799

R tweets with emojis

I scrapped tweets from the twitter API and the package rtweet but I don't know how to work with text with emojis because they are in the form '\U0001f600' and all the regex code that I tried failed until now. I can't get anything of it.

For example

 text = 'text text. \U0001f600'
 grepl('U',text)

Give me FALSE

 grepl('000',text)

Also give me FALSE.

Another problem is that they are often sticked to the word before (for example i am here\U0001f600 )

So how can I make R recognize emojis of that format? What can I put in the grepl that will return me TRUE for any emojis of that format?

Upvotes: 0

Views: 1466

Answers (2)

eniel.rod
eniel.rod

Reputation: 855

Your problem is that you use a single character \ in your code:

text = 'text text. \U0001f600'

It really should be \\:

text = 'text text. \\U0001f600'

I had a similar experience using the rtweet library.

In my case the tweets bring some Unicode code points, not just emoji, and with the following format: "some text<U+code-point>". What I did in this case was "convert" that code point to its graphic representation:

library(stringi)

#I use gsub() to replace "<U+code-point>" with "\\ucode-point", the appropriate format
# And stri_unescape_unicode() to un-escape all Unicode sequences    
stri_unescape_unicode(gsub("<U\\+(\\S+)>",
                                   "\\\\u\\1", #replace by \\ucode-point
                                   "some text with #COVID<U+30FC>19"))
#[1] "some text with #COVIDー19"

If the Unicode code point is not delimited as in my case (<>), you should change the regular expression from "<U\\+(\\S+)>" to "U(\\S+)" . You should be careful here, because this will work correctly if a space character appears after the code point. In case you have words attached to the code point both before and after, it must be more specific and indicate the number of characters that compose it, example "U(....)".

You can try refining this regular expression using Character Classes, or specifying only hexadecimal digits "U([A-Fa-f0-9]+)".

Note that in the RStudio console, the emoji are not going to be seen, you can apply this function but to see the emoji you must use an R library for this purpose. However other characters can be seen: "#COVID<U+30FC>19" appears in the RStudio console as "#COVIDー19".

Edit: Actually "\\S+" didn't work for me when there were consecutive Unicode code points like "<U+0001F926><U+200D><U+2642>". In this case it only replaced the first occurrence, I didn't delve into that, I just changed my regular expression to "<U\\+([A-Fa-f0-9]+)>". "[A-Fa-f0-9]" represents hexadecimal digits.

Upvotes: 0

phiver
phiver

Reputation: 23608

In R there tends to be a package for most things. And in this case textclean and with it comes the lexicon package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji and replace_emoji_identifier

text = c("text text. \U0001f600", "i am here\U0001f600")

# replace emoji with identifier:
textclean::replace_emoji_identifier(text)
[1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis " 

# replace emoji with text representation
textclean::replace_emoji(text)
[1] "text text. grinning face " "i am here grinning face " 

Next you could use sentimentr to use sentiment scoring on the emoji's or for text analysis quanteda. If you just want to check the presence as in your expected output:

grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
[1] TRUE TRUE

Upvotes: 3

Related Questions