Nipun
Nipun

Reputation: 4319

spark times out when connecting to cassandra

I have a lot of data in cassandra cluster with 2 node machines and 1 seed machine. I have a spark master and 3 slave nodes. Each machine is a 8 gb machine dual core. So if my data is around 2,00,000 , and when I do a rdd.count on a dataframe, it takes a lot of time and sometimes even time out.

val tabledf = _sqlContext
 .read
 .format("org.apache.spark.sql.cassandra")
 .options(Map( "table" -> "events", "keyspace" -> "sams"))
 .load

tabledf.registerTempTable("tempdf");
val rdd = _sqlContext.sql("select * from tempdf");
val count = rdd.count.toInt

How can I minimize this count time? I am ready to add more worker machines, but will it make any difference?

Upvotes: 0

Views: 145

Answers (1)

zero323
zero323

Reputation: 330383

The simplest solution is to cache input DataFrame

_sqlContext.cacheTable("tempdf")

otherwise you have to transfer all data all to perform simple count.

Upvotes: 1

Related Questions