Best way to transfer data from one cassandra cluster to an other

Question

I need to transfer data from one cluster to an other.

The table structure is the same on both clusters, what I need to do is select data from Table A, Clustering Key A1 on Cluster 1 and copy it to Table B, Clustering Key A1 on Cluster 2.

There is a high number of entries for that clustering key, I suppose > 50.000.000
I do not want and I cannot copy the whole table, because data between clusters in this table is different.

One option would be to write a script and loop through the data, writing to cluster 2. This would work but sounds inefficient and needs to address problems like "what to do if this script crashes in the middle of operation?"

What is the best approach for that?

Kumar Rohit · Accepted Answer

Based on what I have experienced, Spark is provides the best mechanism to do such activities. You can do it using RDDs and DataFrame APIs both. Below is the code snippet from the reference links:

import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._

import org.apache.spark.SparkContext

sqlContext.setConf("ClusterOne/spark.cassandra.connection.host", "127.0.0.1")
sqlContext.setConf("ClusterTwo/spark.cassandra.connection.host", "127.0.0.2")

//Read from ClusterOne
val dfFromClusterOne = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(Map( 
    "cluster" -> "ClusterOne",
    "keyspace" -> "ks",
    "table" -> "A"
    ))
  .load
  .filter($"id" === 'A1')

//Write to ClusterTwo
dfFromClusterOne
  .write
  .format("org.apache.spark.sql.cassandra")
  .options(Map( 
    "cluster" -> "ClusterTwo",
    "keyspace" -> "ks",
    "table" -> "B"
    ))
  .save
}

Reference links:

Best way to transfer data from one cassandra cluster to an other

Answers (2)

Related Questions