Andrei Catana
Andrei Catana

Reputation: 55

Extract emojis from tweets in R

I'm doing feature extraction from labelled Twitter data to use for predicting fake tweets. I've been spending a lot of time on various GitHub methods, R libraries, stackoverflow posts, but somehow I couldn't find a "direct" method of extracting features related to emojis, e.g. number of emojis, whether the tweet contains emoji(1/0) or even occurrence of specific emojis(that might occur more often in fake/real news). I'm not sure whether there is a point in showing reproducible code.

"Ore" library, for example, offers functions that gather all tweets in an object and extracts emojis, but the formats are problematic (at least, to me) when trying to create features out of the extractions, as mentioned above. The example below uses a whatsapp text sample. I will add twitter data from kaggle to make it somewhat reproducible. Twitter Dataset: https://github.com/sherylWM/Fake-News-Detection-using-Twitter/blob/master/FinalDataSet.csv

# save this to '_chat.txt` (it require a login)
# https://www.kaggle.com/sarthaknautiyal/whatsappsample

library(ore)
library(dplyr)

emoji_src <- "https://raw.githubusercontent.com/laurenancona/twimoji/gh-pages/twitterEmojiProject/emoticon_conversion_noGraphic.csv"
emoji_fil <- basename(emoji_src)
if (!file.exists(emoji_fil)) download.file(emoji_src, emoji_fil)

emoji <- read.csv(emoji_fil, header=FALSE, stringsAsFactors = FALSE)
emoji_regex <- sprintf("(%s)", paste0(emoji$V2, collapse="|"))
compiled <- ore(emoji_regex)

chat <- readLines("_chat.txt", encoding = "UTF-8", warn = FALSE)

which(grepl(emoji_regex, chat, useBytes = TRUE))
##   [1]   8   9  10  11  13  19  20  22  23  62  65  69  73  74  75  82  83  84  87  88  90  91
##  [23]  92  93  94  95 107 108 114 115 117 119 122 123 124 125 130 135 139 140 141 142 143 144
##  [45] 146 147 150 151 153 157 159 161 162 166 169 171 174 177 178 183 184 189 191 192 195 196
##  [67] 199 200 202 206 207 209 220 221 223 224 225 226 228 229 234 235 238 239 242 244 246 247
##  [89] 248 249 250 251 253 259 260 262 263 265 274 275 280 281 282 286 287 288 291 292 293 296
## [111] 302 304 305 307 334 335 343 346 348 351 354 355 356 358 361 362 382 389 390 391 396 397
## [133] 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
## [155] 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 442 451 452
## [177] 454 459 463 465 466 469 471 472 473 474 475 479 482 484 485 486 488 490 492 493 496 503
## [199] 505 506 507 509 517 518 519 525 526 527 528 531 535 540 543 545 548 549 557 558 559 560
## [221] 566 567 571 572 573 574 576 577 578 580 587 589 591 592 594 597 600 601 603 608 609 625
## [243] 626 627 637 638 639 640 641 643 645 749 757 764

chat_emoji_lines <- chat[which(grepl(emoji_regex, chat, useBytes = TRUE))]

found_emoji <- ore.search(compiled, chat_emoji_lines, all=TRUE)
emoji_matches <- matches(found_emoji)

str(emoji_matches, 1)
## List of 254
##  $ : chr [1:4] "\U0001f600" "\U0001f600" "\U0001f44d" "\U0001f44d"
##  $ : chr "\U0001f648"
##  $ : chr [1:2] "\U0001f44d" "\U0001f44d"
##  $ : chr "\U0001f602"
##  $ : chr [1:3] "\U0001f602" "\U0001f602" "\U0001f602"
##  $ : chr [1:4] "\U0001f44c" "\U0001f44c" "\U0001f44c" "\U0001f44c"
##  $ : chr [1:6] "\U0001f602" "\U0001f602" "\U0001f602" "\U0001f602" ...
##  $ : chr "\U0001f600"
##  $ : chr [1:5] "\U0001f604" "\U0001f604" "\U0001f604" "\U0001f603" ...
##  $ : chr "\U0001f44d"
## ...

data_frame(
  V2 = flatten_chr(emoji_matches) %>% 
    map(charToRaw) %>% 
    map(as.character) %>% 
    map(toupper) %>% 
    map(~sprintf("\\x%s", .x)) %>% 
    map_chr(paste0, collapse="")
) %>% 
  left_join(emoji) %>% 
  count(V3, sort=TRUE)
## # A tibble: 89 x 2
##                                                    V3     n
##                                                 <chr> <int>
##  1                             face with tears of joy   110
##  2                     smiling face with smiling eyes    50
##  3         face with stuck-out tongue and winking eye    43
##  4                                       musical note    42
##  5                                      birthday cake    35
##  6                    grinning face with smiling eyes    26
##  7 face with stuck-out tongue and tightly-closed eyes    24
##  8                                      grinning face    21
##  9                                            bouquet    17
## 10                                     thumbs up sign    17
## # ... with 79 more rows

Source: https://gist.github.com/hrbrmstr/e89eb173ae0333f50f94fe5086fedf8b

"textclean" library, offers 2 functions that replace emojis with word equivalents. Source: https://cran.r-project.org/web/packages/textclean/textclean.pdf

Another hit from cran-r's utf8 package description:

Characters with codes above 0xffff, including most emoji, are not supported on Windows.

Does anyone have any other method, direction, package/function I could use?

Upvotes: 1

Views: 1843

Answers (1)

JBGruber
JBGruber

Reputation: 12478

I wrote a function for this purpose in my package rwhatsapp.

As your example is a whatsapp dataset, you can test it directly using the package (install via remotes::install_github("JBGruber/rwhatsapp"))

df <- rwhatsapp::rwa_read("_chat.txt")
#> Warning in readLines(x, encoding = encoding, ...): incomplete final line found
#> on '_chat.txt'
df
#> # A tibble: 392 x 6
#>    time                author    text             source       emoji  emoji_name
#>    <dttm>              <fct>     <chr>            <chr>        <list> <list>    
#>  1 2015-06-25 01:42:12 <NA>      : ‎Vishnu Gaud …  /home/johan… <NULL> <NULL>    
#>  2 2015-06-25 01:42:12 <NA>      : ‎You were added /home/johan… <NULL> <NULL>    
#>  3 2016-12-18 01:57:38 Shahain   :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  4 2016-12-21 21:54:46 Pankaj S… :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  5 2016-12-21 21:57:45 Shahain   :Wow             /home/johan… <NULL> <NULL>    
#>  6 2016-12-21 22:48:51 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  7 2016-12-21 22:49:00 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  8 2016-12-21 22:50:12 Neha Wip… :Awsum😀😀👍🏼👍🏼   /home/johan… <chr … <chr [4]> 
#>  9 2016-12-21 22:51:21 Sakshi    :🙈              /home/johan… <chr … <chr [1]> 
#> 10 2016-12-21 22:57:01 Ganguly   :🙂🙂👍🏻👍🏻        /home/johan… <chr … <chr [4]> 
#> # … with 382 more rows

I extract the emojis from text and store them in a list column as each text can contain multiple emojis. Use unnest to unnest the list column.

library(tidyverse)
df %>% 
  select(time, emoji) %>% 
  unnest(emoji)
#> # A tibble: 654 x 2
#>    time                emoji
#>    <dttm>              <chr>
#>  1 2016-12-21 22:50:12 😀   
#>  2 2016-12-21 22:50:12 😀   
#>  3 2016-12-21 22:50:12 👍🏼   
#>  4 2016-12-21 22:50:12 👍🏼   
#>  5 2016-12-21 22:51:21 🙈   
#>  6 2016-12-21 22:57:01 🙂   
#>  7 2016-12-21 22:57:01 🙂   
#>  8 2016-12-21 22:57:01 👍🏻   
#>  9 2016-12-21 22:57:01 👍🏻   
#> 10 2016-12-21 23:28:51 😂   
#> # … with 644 more rows

You can use this function with any text. The only thing you need to do first is to store the text in a data.frame in a column called text (I use tibble here as it prints nicer):

df <- tibble::tibble(
  text = readLines("/home/johannes/_chat.txt")
)
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
rwhatsapp::lookup_emoji(df, text_field = "text")
#> # A tibble: 764 x 3
#>    text                                                emoji     emoji_name
#>    <chr>                                               <list>    <list>    
#>  1 25/6/15, 1:42:12 AM: ‎Vishnu Gaud created this group <NULL>    <NULL>    
#>  2 25/6/15, 1:42:12 AM: ‎You were added                 <NULL>    <NULL>    
#>  3 18/12/16, 1:57:38 AM: Shahain: <‎image omitted>      <NULL>    <NULL>    
#>  4 21/12/16, 9:54:46 PM: Pankaj Sinha: <‎image omitted> <NULL>    <NULL>    
#>  5 21/12/16, 9:57:45 PM: Shahain: Wow                  <NULL>    <NULL>    
#>  6 21/12/16, 10:48:51 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  7 21/12/16, 10:49:00 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  8 21/12/16, 10:50:12 PM: Neha Wipro: Awsum😀😀👍🏼👍🏼    <chr [4]> <chr [4]> 
#>  9 21/12/16, 10:51:21 PM: Sakshi: 🙈                   <chr [1]> <chr [1]> 
#> 10 21/12/16, 10:57:01 PM: Ganguly: 🙂🙂👍🏻👍🏻            <chr [4]> <chr [4]> 
#> # … with 754 more rows

more details

The way this works under the hood is with a simple dictionary and matching approach. First I split the text into characters and put the characters in a data.frame together with the line id (this is a rewrite of unnest_tokens from tidytext):

lines <- readLines("/home/johannes/_chat.txt")
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
id <- seq_along(lines)
l <- stringi::stri_split_boundaries(lines, type = "character")

out <- tibble(id = rep(id, sapply(l, length)), emoji = unlist(l))

Then I match the characters with a dataset of emoji characters (see ?rwhatsapp::emojis for more infos):

out <- add_column(out,
                  emoji_name = rwhatsapp::emojis$name[
                    match(out$emoji,
                          rwhatsapp::emojis$emoji)
                    ])
out
#> # A tibble: 28,652 x 3
#>       id emoji emoji_name
#>    <int> <chr> <chr>     
#>  1     1 "2"   <NA>      
#>  2     1 "5"   <NA>      
#>  3     1 "/"   <NA>      
#>  4     1 "6"   <NA>      
#>  5     1 "/"   <NA>      
#>  6     1 "1"   <NA>      
#>  7     1 "5"   <NA>      
#>  8     1 ","   <NA>      
#>  9     1 " "   <NA>      
#> 10     1 "1"   <NA>      
#> # … with 28,642 more rows

Now the new column contains either an emoji or NA when no emoji was found. Removing NAs just the emojis are left.

out <- out[!is.na(out$emoji_name), ]
out
#> # A tibble: 656 x 3
#>       id emoji emoji_name                       
#>    <int> <chr> <chr>                            
#>  1     8 😀    grinning face                    
#>  2     8 😀    grinning face                    
#>  3     8 👍🏼    thumbs up: medium-light skin tone
#>  4     8 👍🏼    thumbs up: medium-light skin tone
#>  5     9 🙈    see-no-evil monkey               
#>  6    10 🙂    slightly smiling face            
#>  7    10 🙂    slightly smiling face            
#>  8    10 👍🏻    thumbs up: light skin tone       
#>  9    10 👍🏻    thumbs up: light skin tone       
#> 10    11 😂    face with tears of joy           
#> # … with 646 more rows

The disadvantage of this approach is that you rely on the completeness of your emoji data. However, the dataset in the pacakge includes all known emojis from the unicode website (version 13).

Upvotes: 4

Related Questions