Harsha
Harsha

Reputation: 11

Different Ways of Using Spark Cassandra Connector

I am trying to use Spark Cassandra Connector for analytics on top of data in Cassandra and found two types of implementations. Can anyone throw some light on the difference between two and advantages/disadvantage? I am trying to see which one to use for querying large datasets. Thanks

Option 1 - Using Spark Session SQL

sparkSession.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> table, "keyspace" -> keyspace))
      .load()

Option 2 - Using SCC API

CassandraJavaUtil.javaFunctions(sc)
        .cassandraTable("my_keyspace", "my_table", .mapColumnTo(Integer.class))
        .select("column1");

Upvotes: 1

Views: 64

Answers (1)

Alex Ott
Alex Ott

Reputation: 87329

The difference is that first uses Dataframe API, while second is RDD API. I wouldn’t expect much performance differences between them. From practical point of view, I would recommend to use Dataframe API as much as possible, as it could be more optimized when performing operations on data. Although there are still operations that are available only in RDD API, such as deletion of data, but that’s also easy to achieve on top of Dataframes…

If you worry about performance, then I recommend to use at least connector 2.5.0 that has a lot of optimizations that before we’re available only in commercial version, like, direct join, etc.

Upvotes: 0

Related Questions