Reputation: 25397
I am having a rather large JSON file (Amazon product data) with a lot of single JSON objects. Those JSON objects contain text that I want to preprocess for a specific training task but it is the preprocessing that I need to speed up here. One JSON object looks like this:
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
The task would be to extract reviewText
from each JSON object and perform some preprocessing like lemmatizing etc.
My problem is that I don't know how I could use Spark in order to speed this task up on a cluster.. I am actually not even sure if I can read that JSON file as a stream object-by-object and parallelize the main task.
What would be the best way to get started with this?
Upvotes: 3
Views: 7273
Reputation: 45319
You can use a JSON dataset and then execute a simple sql query to retrieve the reviewText
column's value:
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/reviews.json"
val people = sqlContext.read.json(path)
// Register this DataFrame as a table.
people.registerTempTable("reviews")
val reviewTexts = sqlContext.sql("SELECT reviewText FROM reviews")
Built from examples at the SparkSQL docs (http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)
Upvotes: 1
Reputation: 10882
As you have single JSON object per line, you can use RDD
's textFile
to get RDD[String]
of lines. Then use map
to parse JSON objects using something like json4s and extract necessary field.
You whole code will looks as simple as this (assuming you have SparkContext
as sc
):
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit def formats = DefaultFormats
val r = sc.textFile("input_path").map(l => (parse(l) \ "reviewText").extract[String])
Upvotes: 3
Reputation: 86
I would load JSON data into Dataframe and then select field that i need, also you can use map to apply preprocessing like lemmatising.
Upvotes: -2