Jim Wartnick
Jim Wartnick

Reputation: 2216

Cassandra - copying sstable snapshot from one cluster to another

I know there are several similar questions out there, but I'm still confused about this. As there is a need for this mechanism (copying data from one cluster to another), I'm looking for a little clarification.

Let's assume a very simple scenario. I want to copy a table from one cassandra cluster (C1) to another (C2). The table I'm copying is called "item".

Let's assume the node count of each cluster is the same (source and target nodes in cluster is 4 each). Not sure that matters or not.

I'm attempting to use snapshots and sstableloader to do the trick. I have been able to create a snapshot, copy the snapshot files from C1:N1 (cluster 1 node 1 .../myspace/item-xxxxxx/snapshot/######) to target table directory C2:N1 (cluster 2 node 1: .../myspace/item-xxxxxx). I used sstableloader to load the data and ran nodetool repair. Perfect. The only problem is that as the loaded snapshot was only from one of the source nodes, I only "restored" part of the data (about 485 of the 1k rows). So I'm thinking I'll copy the snapshot from C1:N2 to C2:N1 again and load it up. The problem is that all of the table files already exist on the C2:N1. If I copy the snapshot files from C1:N2 to table directory on C2:N1, I'll blow away the files that are already there. I didn't check all 4 target nodes, but I did check node 2 of the target and the item table directory already existed there too with data files. I'm guessing all of nodes on the target have data files, so I'm stuck with how to sstableload the other 3 source node snapshot files.

So long story short (if that's possible): How am I supposed to load multiple source snapshot files (one from each host on the source cluster) to a target cluster? And to complicate matters, will it matter if the source and target clusters have a different number of nodes (I would think that having less nodes on the target would be potentially be a bigger problem).

What is really needed here, in my opinion, is a way to run the ssableloader on the SOURCE cluster and have it stream the data to a target cluster. Would make life a lot easier, I would think.

Thanks in advance.

-Jim

Upvotes: 0

Views: 1512

Answers (1)

Chris Lohfink
Chris Lohfink

Reputation: 16430

There are two options for bulk loading, It seems you may have them semi-merged together. You are mostly referring to the "copy the sstables" mechanism which is pretty manual and may not be worth the trouble unless performance of the restore is top priority. Using sstable loader is different though and doesn't require that.

sstableloader tool will connect to a node, find all the nodes in that nodes cluster and uses the connection to build metadata/discovery. It will split/stream the sstables that you select to the target cluster in the appropriate token ranges (you wont need the repair). You can run sstableloader from the source clusters nodes, and point it to the destination cluster, you dont need to copy the sstables over yourself (although if they are in different DCs it may be a bit faster).

If you have OpsCenter the automation of these steps can be done for you with a GUI https://docs.datastax.com/en/opscenter/5.2/opsc/online_help/services/opscBackupCloneCluster.html

Upvotes: 2

Related Questions