dks551
dks551

Reputation: 1113

Migrate huge cassandra table to another cluster using spark

I want to migrate our old Cassandra cluster to a new one.

Requirements:-

I have a cassandra cluster of 10 nodes and the table i want to migrate is ~100GB. I am using spark for migrating the data. My spark cluster has 10 nodes and each node has around 16GB memory. In the table we have some junk data which i don't want to migrate to the new table. eg:- Let's say i don't want to transfer the rows which has the cid = 1234. So, what is the best way to migrate this using spark job ? I can't put a where filtering on the cassandraRdd directly as the cid is not the only column included in partitioned key.

Cassandra Table:-

test_table (
    cid text,
    uid text,
    key text,
    value map<text, timestamp>,
    PRIMARY KEY ((cid, uid), key)
) 

Sample Data:-

cid   | uid                | key       | value
------+--------------------+-----------+-------------------------------------------------------------------------
 1234 | 899800070709709707 | testkey1  | {'8888': '2017-10-22 03:26:09+0000'}
 6543 | 097079707970709770 | testkey2  | {'9999': '2017-10-20 11:08:45+0000', '1111': '2017-10-20 15:31:46+0000'}

I am thinking of something like below. But i guess this is not the best efficient approach.

val filteredRdd = rdd.filter { row => row.getString("cid") != "1234" }
filteredRdd.saveToCassandra(KEYSPACE_NAME,NEW_TABLE_NAME) 

What will be the best possible approach here ?

Upvotes: 0

Views: 822

Answers (1)

RussS
RussS

Reputation: 16576

That method is pretty good. You may want to write it in DataFrames to take advantage of the row encoding but this may only have a slight benefit. The key bottleneck in this operation will be writing and reading from Cassandra.

DF Example
spark
  .read
  .format("org.apache.spark.sql.cassandra")
  .option("keyspace", ks)
  .option("table", table)
  .load
  .filter( 'cid !== "1234" )
  .write
  .format("org.apache.spark.sql.cassandra")
  .option("keyspace", ks2)
  .option("table", table2)
  .save

Upvotes: 1

Related Questions