Bangladeshi Voice
Bangladeshi Voice

Reputation: 35

How can I tokenize a text column in R? unnest function not working

I am a new R user. Will really appreciate if you can help me with solving the tokenization problem:

My task in brief: I am trying to import a text file in into R. One of the text columns is Headline. The dataset is basically a collection of news articles related to a disease.

Issue: I have tried many times to tokenize it using the unnest_tokens function.

It is showing me the following error messages:

Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "character"

Error in unnest_tokens(word, Headline) : object 'word' not found

library(dplyr)
library(tidytext)

DengueNews %>%
unnest_tokens(word, Headline)

Note: Link of the dataset:https://drive.google.com/file/d/18VWg-2sO11GpwxMGF1UbziodoWK9B9Ru/view?usp=sharing I am following the instructions from https://www.tidytextmining.com/tidytext.html

Upvotes: 3

Views: 1790

Answers (1)

akrun
akrun

Reputation: 887501

It is not clear how the data was read. As mentioned in the comments, if the data column 'Headline' is character class, it should work. Here, we use read_excl from readxl package to read the dataset. By default, columns that are character will be returned with character class attribute.

library(readxl)
library(tidytext)
DengueNews <- read_excel("DengueNews.xlsx")
class(DengueNew$Headline)
#[1] "character"

DengueNews %>%
  unnest_tokens(word, Headline)
# A tibble: 217 x 4
   Serial Date  Newscontent                                                                                                                                             word      
    <dbl> <chr> <chr>                                                                                                                                                   <chr>     
 1    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dghs      
 2    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 491       
 3    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… more      
 4    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… hospitali…
 5    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… for       
 6    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dengue    
 7    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… in        
 8    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 24hrs     
 9    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… 1         
10    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… more      
# … with 207 more rows

If we change the column class to another class factor, it would fail

library(dplyr)
DengueNews %>%
   mutate(Headline = factor(Headline)) %>%
   unnest_tokens(word, Healine)

Upvotes: 1

Related Questions