AngryR11
AngryR11

Reputation: 83

dplyr unnest_tokens not working

I am loading one of the 5-core datasets from

http://jmcauley.ucsd.edu/data/amazon/

using

library(sparklyr)
library(dplyr)

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "2G"
sc = spark_connect(master = "local",config = config)
df=spark_read_json(sc = sc, name = "videos", path = "Path/to/reviews_Office_Products_5.json")

where one of the variables is a column of text reviews, likewise:

select(df,reviewText)

# Source: lazy query [?? x 1]

# Database: spark_connection reviewText

1 I bought my first HP12C in about 1984 or so, and it served me faithfully until 2002 wh

2 "WHY THIS BELATED REVIEW? I feel very obliged to share my views about this old workhor

3 I have an HP 48GX that has been kicking for more than twenty years and an HP 11 that i

4 I've started doing more finance stuff recently and went looking for a good time-value-

5 For simple calculations and discounted cash flows, this one is still the best. I used

6 While I don't have an MBA, it's hard to believe that a calculator I learned how to use

7 I've had an HP 12C ever since they were first available, roughly twenty years ago. I'

8 Bought this for my boss because he lost his. He loves this calculator & would not be

9 This is a well-designed, simple calculator that handles typical four-function math. La

10 I love this calculator, big numbers and calculate excellent so easy to use and make my

# ... with more rows

I want to split the reviews into tokens, with each row containing a word, but that has proven to be difficult. When I try to use the function unnest_tokens, I get the following error message:

library(stringr)
library(tidytext) 

Word_by_Word <- df %>% unnest_tokens(word, reviewText)

Error in unnest_tokens_.default(., word, reviewText) : unnest_tokens expects all columns of input to be atomic vectors (not lists)

What is happening? How do I fix this without using the command "pull" and coercing the data into the requested format? I can not pull the data as suggested in Extract a dplyr tbl column as a vector or convert the data to a tibble format, btw, because if the database is too big and I do any of those, then the computer runs out of memory even after increasing the 2G limit and running the program on a computer with a lot of memory (that's the hole point of using dplyr instead).

Upvotes: 3

Views: 5074

Answers (1)

TTNK
TTNK

Reputation: 429

It appears that you already have the dataframe in memory. If so, then error code is pointing the way for you. Each entry in reviewText is a list, and unnest_tokens() expects them to be of class vector.

Try using unlist() to transform the reviewText field, either in-place or via mutate().

Upvotes: 1

Related Questions