Vivek Soni
Vivek Soni

Reputation: 1

Data mismatch after Cassandra migration using NetworkTopology

We have our Cassandra cluster running on AWS EC2 with 4 nodes in the ring. We wanted to migrate the whole environment to Azure. We used the process to add a new data center (Azure) with our existing data center (AWS EC2) and strategy used is NetworkTopology and used GossipingPropertyFileSnitch.

Once new data center is added, we ran the below command on all nodes in new data center. #nodetool rebuild -- "datacenter name"

The data was around 3 TB total on all the nodes in existing data center. It took around 6-7 days to rebuild new data center and once system.log said that - All Session completed. we checked the db size on each nodes in new data center and found that all the 4 nodes have reduced size (around 75gb each ie total arount 300gb) than in existing data center.

Could someone please let me know if this is the correct way to check if the data in new data center is same as existing data center.

Upvotes: 0

Views: 97

Answers (1)

Sreekar
Sreekar

Reputation: 1025

Data size is not the right way to check for data mismatch.

Size might vary due to various reasons, some of them I can think of:

  1. Compaction: What are your compaction strategies? Was your data immutable by application? If it is, then compaction is not the reason, otherwise it might be.
  2. Flush: Did you flush the nodes before checking those sizes? If not, then some data might be in memtables.
  3. What are key cache sizes etc.? How did you calculate the data size exactly? Was it a simple "du" on data directory OR individual table files added together? Because data directory contains index files, actual data in tables etc. Again, it's not the right way to do this.

My suggestion is to see the number of rows in each table first. Make sure all settings are same for both DCs. Then write a spark job to check for consistency (through checksum or individual fields, checksums might be faster). Make sure the spark job runs optimally and without shuffling data, it should be able to run and give you result in few hours.

Note: This is the best I could do without really knowing more details.

Upvotes: 1

Related Questions