How to use unnest_token on twitter text data?

Question

I'm trying to run the following and gives me an error message.

data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
      "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
      "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&|<|>"
tidy_data <- data_df %>% 
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
     !word %in% str_remove_all(stop_words$word, "'"),
     str_detect(word, "[a-z]"))

It gives me the following error message:

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), : argument str should be a character vector (or an object coercible to)"

How can I fix it?

Julia Silge · Accepted Answer

The main problem is that you gave your text column the name data but then referred to it later as text. Try it something more like this:

library(tidyverse)
library(tidytext)

text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation.  . . Video:  . . -  #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
          "#Copingwiththelockdown... Festac town, Lagos.  #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
          "Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma  . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)

remove_reg <- "&|<|>"

data_df %>% 
  mutate(text = str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text) %>%
  anti_join(get_stopwords()) %>%
  filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#>    word      
#>         
#>  1 said      
#>  2 cant      
#>  3 lil       
#>  4 dance     
#>  5 party     
#>  6 stuck     
#>  7 quarantine
#>  8 happy     
#>  9 friday    
#> 10 cousins   
#> # … with 95 more rows

If you are specifically interested in Twitter data, consider using token = "tweets":

data_df %>% 
  unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#>    word 
#>    
#>  1 who  
#>  2 said 
#>  3 we   
#>  4 cant 
#>  5 have 
#>  6 a    
#>  7 lil  
#>  8 dance
#>  9 party
#> 10 while
#> # … with 111 more rows

^{Created on 2020-04-12 by the reprex package (v0.3.0)}

This option handles hashtags and usernames well.

How to use unnest_token on twitter text data?

Answers (1)

Related Questions