Reputation: 2104
I need to transfer data from one cluster to an other.
The table structure is the same on both clusters, what I need to do is select data from Table A, Clustering Key A1 on Cluster 1 and copy it to Table B, Clustering Key A1 on Cluster 2.
There is a high number of entries for that clustering key, I suppose > 50.000.000
I do not want and I cannot copy the whole table, because data between clusters in this table is different.
One option would be to write a script and loop through the data, writing to cluster 2. This would work but sounds inefficient and needs to address problems like "what to do if this script crashes in the middle of operation?"
What is the best approach for that?
Upvotes: 2
Views: 743
Reputation: 1538
For bulk data copy, you should think about sstableloader. This is a good tool to copy the data from one cluster and load into another cluster. please refer below documentation. https://cassandra.apache.org/doc/latest/tools/sstable/sstableloader.html?highlight=sstableloader
Upvotes: 1
Reputation: 507
Based on what I have experienced, Spark is provides the best mechanism to do such activities. You can do it using RDDs
and DataFrame
APIs both. Below is the code snippet from the reference links:
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql._
import org.apache.spark.SparkContext
sqlContext.setConf("ClusterOne/spark.cassandra.connection.host", "127.0.0.1")
sqlContext.setConf("ClusterTwo/spark.cassandra.connection.host", "127.0.0.2")
//Read from ClusterOne
val dfFromClusterOne = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"cluster" -> "ClusterOne",
"keyspace" -> "ks",
"table" -> "A"
))
.load
.filter($"id" === 'A1')
//Write to ClusterTwo
dfFromClusterOne
.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"cluster" -> "ClusterTwo",
"keyspace" -> "ks",
"table" -> "B"
))
.save
}
Reference links:
Upvotes: 2