Stefan Falk
Stefan Falk

Reputation: 25397

What is the fastest way to transform a very large JSON file with Spark?

I am having a rather large JSON file (Amazon product data) with a lot of single JSON objects. Those JSON objects contain text that I want to preprocess for a specific training task but it is the preprocessing that I need to speed up here. One JSON object looks like this:

{
  "reviewerID": "A2SUAM1J3GNN3B",
  "asin": "0000013714",
  "reviewerName": "J. McDonald",
  "helpful": [2, 3],
  "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
  "overall": 5.0,
  "summary": "Heavenly Highway Hymns",
  "unixReviewTime": 1252800000,
  "reviewTime": "09 13, 2009"
}

The task would be to extract reviewText from each JSON object and perform some preprocessing like lemmatizing etc.

My problem is that I don't know how I could use Spark in order to speed this task up on a cluster.. I am actually not even sure if I can read that JSON file as a stream object-by-object and parallelize the main task.

What would be the best way to get started with this?

Upvotes: 3

Views: 7273

Answers (3)

ernest_k
ernest_k

Reputation: 45319

You can use a JSON dataset and then execute a simple sql query to retrieve the reviewText column's value:

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "path/reviews.json"
val people = sqlContext.read.json(path)
// Register this DataFrame as a table.
people.registerTempTable("reviews")
val reviewTexts = sqlContext.sql("SELECT reviewText FROM reviews")

Built from examples at the SparkSQL docs (http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets)

Upvotes: 1

Aivean
Aivean

Reputation: 10882

As you have single JSON object per line, you can use RDD's textFile to get RDD[String] of lines. Then use map to parse JSON objects using something like json4s and extract necessary field.

You whole code will looks as simple as this (assuming you have SparkContext as sc):

import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit def formats = DefaultFormats

val r  = sc.textFile("input_path").map(l => (parse(l) \ "reviewText").extract[String])

Upvotes: 3

gabriel9
gabriel9

Reputation: 86

I would load JSON data into Dataframe and then select field that i need, also you can use map to apply preprocessing like lemmatising.

Upvotes: -2

Related Questions