Brandt
Brandt

Reputation: 565

Loading JSON dataset into Spark, then use filter, map, etc

I'm new to Apache Spark, and would like to take a dataset saved in JSON (a list of dictionaries), load it into an RDD, then apply operations like filter and map. This seems to me like it should be simple, but after looking around Spark's docs the only thing I found used SQL queries (https://spark.apache.org/docs/1.1.0/sql-programming-guide.html), which is not how I'd like to interact with the RDD.

How can I load a dataset saved in JSON into an RDD? If I missed the relevant documentation, I'd appreciate a link.

Thanks!

Upvotes: 2

Views: 5619

Answers (2)

Aaron Bannin
Aaron Bannin

Reputation: 127

Have you tried appling json.loads() in the mapping?

import json
f = sc.textFile('/path/to/file')
d = lines.map(lambda line: json.loads(line))

Upvotes: 1

tgpfeiffer
tgpfeiffer

Reputation: 1858

You could do something like

import org.json4s.JValue
import org.json4s.native.JsonMethods._

val jsonData: RDD[JValue] = sc.textFile(path).flatMap(parseOpt)

and then do your JSON processing on that JValue, like

jsonData.foreach(json => {
  println(json \ "someKey")
  (json \ "id") match {
    case JInt(x) => ???
    case _ => ???
})

Upvotes: 3

Related Questions