user3398321
user3398321

Reputation: 73

Migrate data from one cassandra cluster to another

Hi I want to migrate data from my cassandra cluster to another cassandra cluster. I have seen many posts suggesting various methods but are not very clear or have limitations. The methods seen are as follows:

  1. Using COPY TO and COPY FROM command: The is easy to use but seems to have a limitation on the number of rows it can copy.
  2. Using SSTABLELOADER: Most articles suggests using sstableloader to move data from one cluster to another. But did not get clear details on steps to create sstables (is it possible to use some nodetool command or require java application to be created? Are these created per node or per cluster? If created how to move them from one cluster to another?) or creating snapshots which is tedious in way that they are created per node and have to be transferred to another cluster. Have also seen answers suggesting using parallel ssh to create snapshot for whole cluster but did not get any example for this as well.

Any help would be appreciated.

Upvotes: 1

Views: 7121

Answers (2)

J.B. Langston
J.B. Langston

Reputation: 410

If you are able to set up the target cluster with exactly the same topology as the source cluster, the fastest way may be to simply copy the data files from the source to the target cluster, since this avoids the overhead of processing the data to redistribute it to different nodes. In order for this to work, your target cluster must have the same number of nodes, the same rack configuration, and even the same tokens assigned to each node.

To get the tokens for a source node, you can run nodetool info -T | grep Token | awk '{print $3}' | tr '\n' , | sed 's/,$/\n/'. You can then copy the comma-separated list of tokens from the output and paste it into the initial_token setting in your target node's cassandra.yaml. Once you start the node, check its tokens using nodetool info -T to verify that it has the correct tokens. Repeat these steps for each node in the target cluster.

Once you have all of your target nodes set up with exactly the same tokens, DC, and racks as the source cluster, take a snapshot of the desired tables on the source cluster and copy the snapshots to the corresponding node's data directories on the target cluster. DataStax OpsCenter can automate the process of backing up and restoring data and will use direct copying for clusters with the same topology. It appears that medusa can do this too though I have not used this tool before.

Upvotes: 1

Alex Ott
Alex Ott

Reputation: 87069

It's really a question that requires more information to provide definitive answer. For example, do you need to keep the metadata, such as, WriteTime and TTLs on data, or not? Does the destination cluster has the same topology (number of nodes, token allocation, etc.).

Basically, you have following options:

  1. Use sstableloader - tool shipped with Cassandra itself that is used for restoring from backups, etc. To perform data migration you need to create a snapshot of the table to load (using nodetool snapshot), and run sstableloader on that snapshot. Main advantage is that it will keep metadata (TTL/WriteTime). Main disadvantage is that you need to perform taking snapshot/loading on all nodes of the source cluster, and you need to have exactly the same schema and partitioner in the destination cluster;
  2. You can use backup/restore tool, such as, medusa, that basically automating the taking of snapshot & loading the data;
  3. You can use Apache Spark to copy data from one table to another using Spark Cassandra Connector, for example, as described in this blog post - just read table for one cluster, and write to a table in another cluster. Works fine for simple copy operations, and you have a possibility to perform transformation of data if necessary, but becomes more complex if you need to preserve metadata. Plus it needs Spark;
  4. Use DataStax Bulk Loader (DSBulk) to export data to files on disk, and load into another cluster. In contrast to cqlsh's COPY command, it's heavily optimized for loading/unloading of big amounts of data. It works with Cassandra 2.1+ and most DSE versions (except ancient ones).

Upvotes: 6

Related Questions