Cassandra 3.11 SSTableLoader mechanics

Question

I've been using the SSTableLoader utility to bulk transfer data between two distinct Cassandra clusters and I was wondering if anyone else has run into the same issues. The source cluster has data, the destination does not.

I've read the datastax page on the utility's details but still I have some unanswered questions about how it works.

I am using the utility on the source cluster's live nodes and the commands follow this format:

sstableloader -d target.host.ip -v -f /etc/cassandra/cassandra.yaml /cassandra/data/keyspace1/table1-uuid

The clusters are both setup with 256 vnodes each with 6 nodes in each cluster. The schema is RF = 3 in both environments and the tables are all structured the same.

So my questions are as follows:

1) The utility pulls source cluster information from the cassandra.yaml you specify, but you have to specify an absolute path to the SSTables still. So does running SSTableLoader from a single node give me the entire table at the destination once complete? It seems difficult to verify since the token ranges are different at the destination cluster.

2) The datastax information says:

To get the best throughput from SSTable loading, you can use multiple instances of sstableloader to stream across multiple machines. No hard limit exists on the number of SSTables that sstableloader can run at the same time, so you can add additional loaders until you see no further improvement.

Does this mean that for a single table, I would start multiple instances of SSTableLoader across multiple source machines? Or does it just mean that I can use SSTableLoader for multiple different tables on multiple machines at the same time. I'm trying to understand if the throughput gain they are mentioning is for a single table or for just multiple tables in flight.

3) What syntax modification is needed to run from snapshots instead? I took a snapshot and tested by running the same command but further down into the snapshot directory of the table and it didn't parse correctly it was saying "snapshot" is an invalid keyspace.

Anyway thanks hope I was clear enough with my questions.

TomerSan · Accepted Answer

1) If your RF=3 and your cluster had 3 nodes, than each node holds ALL the data. Still there could be some minor changes due to updates that did not propagate to all replicas yet. If the number of nodes in your cluster is bigger than the RF (you your case 6 nodes, RF=3), than every node holds a combination of 50% of the data (different token ranges). Anyhow, you need to run the sstableloader on all keyspaces + tables from each of your source nodes to the new cluster's destination nodes (assuming 1:1 ratio).

2) Yes, you can run multiple sstableloaders on the same table / keyspace from each of the source nodes, to it's matching destination node in parallel. But it also means you can do it for different keyspace / tables, as long as eventually you performed it from all source nodes for all keyspace / tables to their matching destination nodes (assuming 1:1 ratio).

3) Restoring from Backup (Snapshot) is a different procedure which does not involve using sstableloader. You can read more about it here.

There is also an option to use nodetool refresh to load sstables from all source nodes to the new destination nodes, but it should be used only when the num_nodes=RF. Read more about it here

Cassandra 3.11 SSTableLoader mechanics

Answers (1)

Related Questions