tbsalling
tbsalling

Reputation: 4545

Which node repair strategy to apply during bulk loads?

Currently I am bulk loading 30TB of data into a ten-node cluster running Cassandra 2.1.2. I bulk load from flat files in stages of ~5 TB using 'sstableloader'.

I am aware, that it is required to run 'nodetool repair' periodically each Cassandra-node. But currently (at 10TB load) each node repair takes 48+ hours. There is a pressure to complete with the bulk load. So which repair strategy is better:

  1. To nodetool repair each node in turn between each 5 TB stage?
  2. To bulk load all 30TB and then start to repair?
  3. To repair nodes simultaneously with sstableloader running?

Ideally I would need a tool to measure the need for repairs. A measure of the entropy. Does such a thing exist?

Upvotes: 2

Views: 270

Answers (1)

Stefan Podkowinski
Stefan Podkowinski

Reputation: 5249

Theres no real need to run repair between each import run if you're about to bootstrap your cluster with data. The sstableloader tool should take care that all replicas will be created correctly in the cluster. You can do a full repair after all imports have been finished. However, keep in mind the repair can only make sure data is replicated across the cluster in a consistent way. In case the loader did not save parts of the data at all - for whatever reason - the repair would not able to notice. So at some point you have to trust the tableloader or write your own script to validate the results.

Upvotes: 2

Related Questions