Reputation: 31
I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver. I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.
Upvotes: 1
Views: 199
Reputation: 1
A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.
Upvotes: 0
Reputation: 87119
When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:
When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.
You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.
P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.
Upvotes: 0