Reputation: 4319
I have a lot of data in cassandra cluster with 2 node machines and 1 seed machine. I have a spark master and 3 slave nodes. Each machine is a 8 gb machine dual core. So if my data is around 2,00,000 , and when I do a rdd.count on a dataframe, it takes a lot of time and sometimes even time out.
val tabledf = _sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "events", "keyspace" -> "sams"))
.load
tabledf.registerTempTable("tempdf");
val rdd = _sqlContext.sql("select * from tempdf");
val count = rdd.count.toInt
How can I minimize this count time? I am ready to add more worker machines, but will it make any difference?
Upvotes: 0
Views: 145
Reputation: 330383
The simplest solution is to cache input DataFrame
_sqlContext.cacheTable("tempdf")
otherwise you have to transfer all data all to perform simple count.
Upvotes: 1