Reputation: 1
We have our Cassandra cluster running on AWS EC2 with 4 nodes in the ring. We wanted to migrate the whole environment to Azure. We used the process to add a new data center (Azure) with our existing data center (AWS EC2) and strategy used is NetworkTopology and used GossipingPropertyFileSnitch.
Once new data center is added, we ran the below command on all nodes in new data center. #nodetool rebuild -- "datacenter name"
The data was around 3 TB total on all the nodes in existing data center. It took around 6-7 days to rebuild new data center and once system.log said that - All Session completed. we checked the db size on each nodes in new data center and found that all the 4 nodes have reduced size (around 75gb each ie total arount 300gb) than in existing data center.
Could someone please let me know if this is the correct way to check if the data in new data center is same as existing data center.
Upvotes: 0
Views: 97
Reputation: 1025
Data size is not the right way to check for data mismatch.
Size might vary due to various reasons, some of them I can think of:
My suggestion is to see the number of rows in each table first. Make sure all settings are same for both DCs. Then write a spark job to check for consistency (through checksum or individual fields, checksums might be faster). Make sure the spark job runs optimally and without shuffling data, it should be able to run and give you result in few hours.
Note: This is the best I could do without really knowing more details.
Upvotes: 1