flustercludge
flustercludge

Reputation: 77

Issue with tidytext() : unable to apply unnest_tokens to dataframe

I've been trying to apply unnest_tokens from tidytext in a dataframe column to generate common bigrams and trigrams. Theyre short texts from > 200 articles. They're also a column subset from a larger csv.

I've tried the following , to no avail:
1. setting stringsasfactors = FALSE
2. used unnest_, unnest_tokens_.

Example : bookparagraphs.csv

a<- data.frame("texts" = bookparagraphs$text[1:10], stringsAsFactors = FALSE)

str(a)

'data.frame':   10 obs. of  1 variable:
$ text: Factor w/ 6552 levels 

Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.***

However, tm_map works wonderfully when I converted my texts > corpus > DTM etc . I'm able to count and review word co-occurrences just fine.

I'd like to get better at using tidytext, hence I'm looking to finding out how this works and where I went wrong.

Appreciate any suggestions ! Thank you.

Upvotes: 1

Views: 1356

Answers (1)

phiver
phiver

Reputation: 23608

The error you get in tidytext is because texts is a factor. This means your bookparagraphs$text is a factor. Probably from reading in bookparagraphs.csv. When you just use a <- data.frame("texts" = bookparagraphs$text[1:10], stringsAsFactors = FALSE), the stringAsFactors has no effect on the factor bookparagraphs$text. Either read the bookparagraphs.csv with stringAsFactors = FALSE or use readr to load the data. Or use:

a <- data.frame("texts" = as.character(bookparagraphs$text[1:10]), stringAsFactors = FALSE)

This will coerce the bookparagraphs$text to a character vector, and the stringAsFactors = FALSE prevents it from being turned into a factor again.

After this, you can use unnest_tokens without an issue.

Upvotes: 1

Related Questions