Spark-Cassandra Connector -- Spark and Cassandra partitions -- data locality

Question

I have a 16 node cluster where every node has Spark and Cassandra installed while I am using the Spark-Cassandra Connector 3.0.0. The spark cluster has 16 executors with 2 cores each, total of 32 cores. I have ~2.2 billion rows(also primary keys) in Cassandra database with 4.827 unique partition keys in total. I am using dataframes/datasets and the code is in Java while I also use .config("spark.sql.shuffle.partitions",96) in spark configuration. In the code I select all 2.2 billion rows and join on the partition key.

In the Spark GUI I see there is a broadcast with 32 tasks which means that Sparks Join is used and 32 tasks are because of the available cores. Does this mean that there will be 32 Spark partitions created initially where these 2.2 billion rows will reside?
Should I definitely use .repartitionByCassandraReplica before I use Join? I am not convinced that it is needed, but the truth is that if I try to use it I get an error that "the symbol cannot be found". Also, DirectJoin is activated when I have less than 2600 partition keys.

My aim is to take advantage of data locality and avoid data transfer.

EDIT 1

For question 1, I went through the link you sent and as you say the size is based on whatever is in the system.size_estimates table.

According to nodetool I have 16 nodes x ~8.9Gb = 143Gb with a replication factor 3, so 143/3 = 47.6Gb. So according to the formula there must be around 47600/64 = ~744 spark partitions.
However according to the system.size_estimates table, the partitions_count column has a sum of 1883 partitions and the mean_partition_size is 48042720. This means that the table size is 1883 x 48Mb = 90384Mb or ~90Gb which is a bit far from 143Gb.

For question 2, my Cassandra table is:

 CREATE TABLE experiment(
 experimentid varchar,
 description text,
 rt float,
 intensity float,
 mz float,
 identifier text,
 chemical_formula text,
 filename text,
 PRIMARY KEY ((experimentid),description, rt, intensity, mz, identifier, chemical_formula, filename));

and the spark code is:

Dataset dfexplist = sp.createDataset(experimentlist, Encoders.STRING()).toDF("experimentid");

Dataset metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
                .options(new HashMap() {
                    {
                        put("keyspace", "mdb");
                        put("table", "experiment");
                    }
                })
                .load().select(col("experimentid"), col("description"), col("intensity")).join(dfexplist, "experimentid").repartition(col("experimentid"));

Does this achieves data locality? Is there a shuffle happening when or before I join? At the end I am repartition according to the partition key so to avoid any future shuffle in later calculations.

Spark-Cassandra Connector -- Spark and Cassandra partitions -- data locality

Answers (1)

Related Questions