Reputation: 799
I scrapped tweets from the twitter API and the package rtweet
but I don't know how to work with text with emojis because they are in the form '\U0001f600' and all the regex code that I tried failed until now. I can't get anything of it.
For example
text = 'text text. \U0001f600'
grepl('U',text)
Give me FALSE
grepl('000',text)
Also give me FALSE.
Another problem is that they are often sticked to the word before (for example i am here\U0001f600
)
So how can I make R recognize emojis of that format? What can I put in the grepl that will return me TRUE for any emojis of that format?
Upvotes: 0
Views: 1466
Reputation: 855
Your problem is that you use a single character \
in your code:
text = 'text text. \U0001f600'
It really should be \\
:
text = 'text text. \\U0001f600'
I had a similar experience using the rtweet library.
In my case the tweets bring some Unicode code points, not just emoji, and with the following format: "some text<U+code-point>"
. What I did in this case was "convert" that code point to its graphic representation:
library(stringi)
#I use gsub() to replace "<U+code-point>" with "\\ucode-point", the appropriate format
# And stri_unescape_unicode() to un-escape all Unicode sequences
stri_unescape_unicode(gsub("<U\\+(\\S+)>",
"\\\\u\\1", #replace by \\ucode-point
"some text with #COVID<U+30FC>19"))
#[1] "some text with #COVIDー19"
If the Unicode code point is not delimited as in my case (<>), you should change the regular expression from "<U\\+(\\S+)>"
to "U(\\S+)"
. You should be careful here, because this will work correctly if a space character appears after the code point. In case you have words attached to the code point both before and after, it must be more specific and indicate the number of characters that compose it, example "U(....)"
.
You can try refining this regular expression using Character Classes, or specifying only hexadecimal digits "U([A-Fa-f0-9]+)"
.
Note that in the RStudio console, the emoji are not going to be seen, you can apply this function but to see the emoji you must use an R library for this purpose. However other characters can be seen: "#COVID<U+30FC>19"
appears in the RStudio console as "#COVIDー19"
.
Edit: Actually "\\S+"
didn't work for me when there were consecutive Unicode code points like "<U+0001F926><U+200D><U+2642>"
. In this case it only replaced the first occurrence, I didn't delve into that, I just changed my regular expression to "<U\\+([A-Fa-f0-9]+)>"
.
"[A-Fa-f0-9]"
represents hexadecimal digits.
Upvotes: 0
Reputation: 23608
In R there tends to be a package for most things. And in this case textclean
and with it comes the lexicon
package which has a lot of dictionaries. Using textclean you have 2 functions you can use, replace_emoji
and replace_emoji_identifier
text = c("text text. \U0001f600", "i am here\U0001f600")
# replace emoji with identifier:
textclean::replace_emoji_identifier(text)
[1] "text text. lexiconvygwtlyrpywfarytvfis " "i am here lexiconvygwtlyrpywfarytvfis "
# replace emoji with text representation
textclean::replace_emoji(text)
[1] "text text. grinning face " "i am here grinning face "
Next you could use sentimentr
to use sentiment scoring on the emoji's or for text analysis quanteda
. If you just want to check the presence as in your expected output:
grepl("lexicon[[:alpha:]]{20}", textclean::replace_emoji_identifier(text))
[1] TRUE TRUE
Upvotes: 3