Reputation: 81
We are proposing Cassandra to be implemented as a database backend for a large archiving solution (a large number of writes compared to reads). We are looking for inputs on Cassandra's replication and deployment strategy to fit our use case.
The choice of Cassandra was based on following factors:
Data Estimates
Use Case
We have two data centers - Operations DC and Analytics DC (to isolate the read and write workloads). At the end of this post is the diagram depicting the proposed architecture. Due to storage constraints, we can't store data generated over life-time on Operations DC. Hence, we are planning to move the data from Operations DC to Analytics DC as per defined policy (let's say after 1 week).
Questions
We are planning to use Cassandra's inbuilt feature of time-to-live to delete the data (only from operations DC). The data deleted from Operations DC should not get deleted from the Analytics DC. How to prevent replication of deleted data?
I have read that a single Cassandra node can handle up to 2-3 TB of data. Any documented reference of any larger Cassandra implementations will help.
How many Cassandra nodes shall be deployed to handle such growth? And what shall be a recommended deployment strategy?
Performance considerations: Although the storage at Operations DC will be limited (3-7 days data, about 5-10 TB), the data storage at Analytics DC is cumulative and continues to grow with time. Will the database growth at analytics DC affect the replication and degrade the performance of Operations DC.
The purpose here is to know if Cassandra's inbuilt feature can be used to support the above requirements. I'm aware of the most obvious solution. Not to have replication between the two DC. Dump the data of last one week from Operations DC and move it to Analytics DC.
Upvotes: 4
Views: 760
Reputation: 4426
No
Yes, replication is configured per keyspace.
This won’t work out of the box, but it can be made to work. I can think of two relatively easy options. The easiest is to batch write to both keyspaces/DCs, one with TTLs and One without . You could potentially also make a keyspace per month/year, start with it replicating to multiple DCs, and remove the “normal” DC when appropriate.
Cassandra cluster - data density (data size per node) - looking for feedback and advises
Cassandra does ok up to about 800-1000 instances in a cluster, but it’s often advisable to shard smaller than that for your own ease of operation
DCs can be asymmetrical.
Upvotes: 2
Reputation: 87299
I think that in your case is just makes sense to "separate" DCs - keyspaces in one DC aren't replicated into another DC, for example - just create keyspaces with necessary corresponding replication settings.
Or you can replicate "transactional" load into both DCs, and have a job that will periodically copy data from "transactional" keyspace into "analytics" keyspace, and then remove data from "transactional" keyspace to free the space.
But it's not really possible to have something like as you describe, until you use something like DSE's Advanced Replication (but it's not about DCs, but more about separate clusters).
Upvotes: 2