Reputation: 77
I've been trying to apply unnest_tokens from tidytext in a dataframe column to generate common bigrams and trigrams. Theyre short texts from > 200 articles. They're also a column subset from a larger csv.
I've tried the following , to no avail:
1. setting stringsasfactors = FALSE
2. used unnest_, unnest_tokens_.
Example :
bookparagraphs.csv
a<- data.frame("texts" = bookparagraphs$text[1:10], stringsAsFactors = FALSE)
str(a)
'data.frame': 10 obs. of 1 variable:
$ text: Factor w/ 6552 levels
Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.***
However, tm_map works wonderfully when I converted my texts > corpus > DTM etc . I'm able to count and review word co-occurrences just fine.
I'd like to get better at using tidytext, hence I'm looking to finding out how this works and where I went wrong.
Appreciate any suggestions ! Thank you.
Upvotes: 1
Views: 1356
Reputation: 23608
The error you get in tidytext is because texts is a factor. This means your bookparagraphs$text is a factor. Probably from reading in bookparagraphs.csv. When you just use a <- data.frame("texts" = bookparagraphs$text[1:10], stringsAsFactors = FALSE)
, the stringAsFactors has no effect on the factor bookparagraphs$text. Either read the bookparagraphs.csv with stringAsFactors = FALSE
or use readr to load the data. Or use:
a <- data.frame("texts" = as.character(bookparagraphs$text[1:10]), stringAsFactors = FALSE)
This will coerce the bookparagraphs$text to a character vector, and the stringAsFactors = FALSE prevents it from being turned into a factor again.
After this, you can use unnest_tokens
without an issue.
Upvotes: 1