Reputation: 11
I am trying to use Spark Cassandra Connector for analytics on top of data in Cassandra and found two types of implementations. Can anyone throw some light on the difference between two and advantages/disadvantage? I am trying to see which one to use for querying large datasets. Thanks
Option 1 - Using Spark Session SQL
sparkSession.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> table, "keyspace" -> keyspace))
.load()
Option 2 - Using SCC API
CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(Integer.class))
.select("column1");
Upvotes: 1
Views: 64
Reputation: 87329
The difference is that first uses Dataframe API, while second is RDD API. I wouldn’t expect much performance differences between them. From practical point of view, I would recommend to use Dataframe API as much as possible, as it could be more optimized when performing operations on data. Although there are still operations that are available only in RDD API, such as deletion of data, but that’s also easy to achieve on top of Dataframes…
If you worry about performance, then I recommend to use at least connector 2.5.0 that has a lot of optimizations that before we’re available only in commercial version, like, direct join, etc.
Upvotes: 0