Reputation: 83
I am loading one of the 5-core datasets from
http://jmcauley.ucsd.edu/data/amazon/
using
library(sparklyr)
library(dplyr)
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "2G"
sc = spark_connect(master = "local",config = config)
df=spark_read_json(sc = sc, name = "videos", path = "Path/to/reviews_Office_Products_5.json")
where one of the variables is a column of text reviews, likewise:
select(df,reviewText)
# Source: lazy query [?? x 1]
# Database: spark_connection reviewText
1 I bought my first HP12C in about 1984 or so, and it served me faithfully until 2002 wh
2 "WHY THIS BELATED REVIEW? I feel very obliged to share my views about this old workhor
3 I have an HP 48GX that has been kicking for more than twenty years and an HP 11 that i
4 I've started doing more finance stuff recently and went looking for a good time-value-
5 For simple calculations and discounted cash flows, this one is still the best. I used
6 While I don't have an MBA, it's hard to believe that a calculator I learned how to use
7 I've had an HP 12C ever since they were first available, roughly twenty years ago. I'
8 Bought this for my boss because he lost his. He loves this calculator & would not be
9 This is a well-designed, simple calculator that handles typical four-function math. La
10 I love this calculator, big numbers and calculate excellent so easy to use and make my
# ... with more rows
I want to split the reviews into tokens, with each row containing a word, but that has proven to be difficult. When I try to use the function unnest_tokens, I get the following error message:
library(stringr)
library(tidytext)
Word_by_Word <- df %>% unnest_tokens(word, reviewText)
Error in unnest_tokens_.default(., word, reviewText) : unnest_tokens expects all columns of input to be atomic vectors (not lists)
What is happening? How do I fix this without using the command "pull" and coercing the data into the requested format? I can not pull the data as suggested in Extract a dplyr tbl column as a vector or convert the data to a tibble format, btw, because if the database is too big and I do any of those, then the computer runs out of memory even after increasing the 2G limit and running the program on a computer with a lot of memory (that's the hole point of using dplyr instead).
Upvotes: 3
Views: 5074
Reputation: 429
It appears that you already have the dataframe in memory. If so, then error code is pointing the way for you. Each entry in reviewText
is a list
, and unnest_tokens()
expects them to be of class vector
.
Try using unlist()
to transform the reviewText
field, either in-place or via mutate()
.
Upvotes: 1