Efficient Filtering on a huge data frame in Spark

Question

I have a Cassandra table with 500 million rows. I would like to filter based on a field which is a partition key in Cassandra using spark.

Can you suggest the best possible/efficient approach to filter in Spark/Spark SQL based on the list keys which is also a pretty large.

Basically i need only those rows from the Cassandra table which are present in the list of keys.

We are using DSE and its features. The approach i am using is taking lot of time roughly around an hour.

Alex Karpov · Accepted Answer

Have you checked repartitionByCassandraReplica and joinWithCassandraTable ?

https://github.com/datastax/spark-cassandra-connector/blob/75719dfe0e175b3e0bb1c06127ad4e6930c73ece/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12

joinWithCassandraTable utilizes the java drive to execute a single query for every partition required by the source RDD so no un-needed data will be requested or serialized. This means a join between any RDD and a Cassandra Table can be performed without doing a full table scan. When performed between two Cassandra Tables which share the same partition key this will not require movement of data between machines. In all cases this method will use the source RDD's partitioning and placement for data locality.

The method repartitionByCassandraReplica can be used to relocate data in an RDD to match the replication strategy of a given table and keyspace. The method will look for partition key information in the given RDD and then use those values to determine which nodes in the Cluster would be responsible for that data.

Efficient Filtering on a huge data frame in Spark

Answers (1)

Related Questions