Reputation:
I`m using Spark 1.2.1 with spark-cassandra-connector :
//join with cassandra
val rdd = some_array.map(x => SomeClass(x._1,x._2)).joinWithCassandraTable(keyspace, some_table)
println(timer, "Join")
//get only the jsons and create rdd temp table
val jsons = rdd.map(_._2.getString("this"))
val jsonSchemaRDD = sqlContext.jsonRDD(jsons)
jsonSchemaRDD.registerTempTable("this_json")
println(timer, "Map")
The output is:
Timer "Join"- 558 ms
Timer "Map"- 290284 ms
I guess the "joinWithCassandraTable()" function is lazy, if so, what is fire it up?
Upvotes: 3
Views: 247
Reputation: 330123
Actually the part which will trigger an evaluation here is sqlContext.jsonRDD
. Since you don't provide the schema
it has to materialize jsons
to be able to infer it.
joinWithCassandraTable
is is kind of similar since it has to connect to the Cassandra and fetch required metadata. See Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra
Upvotes: 4