Reputation: 565
I'm new to Apache Spark, and would like to take a dataset saved in JSON (a list of dictionaries), load it into an RDD, then apply operations like filter and map. This seems to me like it should be simple, but after looking around Spark's docs the only thing I found used SQL queries (https://spark.apache.org/docs/1.1.0/sql-programming-guide.html), which is not how I'd like to interact with the RDD.
How can I load a dataset saved in JSON into an RDD? If I missed the relevant documentation, I'd appreciate a link.
Thanks!
Upvotes: 2
Views: 5619
Reputation: 127
Have you tried appling json.loads() in the mapping?
import json
f = sc.textFile('/path/to/file')
d = lines.map(lambda line: json.loads(line))
Upvotes: 1
Reputation: 1858
You could do something like
import org.json4s.JValue
import org.json4s.native.JsonMethods._
val jsonData: RDD[JValue] = sc.textFile(path).flatMap(parseOpt)
and then do your JSON processing on that JValue, like
jsonData.foreach(json => {
println(json \ "someKey")
(json \ "id") match {
case JInt(x) => ???
case _ => ???
})
Upvotes: 3