We are proposing Cassandra to be implemented as a database backend for a large archiving solution (a large number of writes compared to reads). We are looking for inputs on Cassandra's replication and deployment strategy to fit our use case. The choice of Cassandra was based on following factors: Supports Large throughput for ‘write’ operations - thousands of simultaneous writes per second Suitability for Engineering data (mainly Time series data) High availability to support continuous telescope operations Tools support e.g. Analytics, Reporting Data Estimates 250 TB of growth per year (50 years of System lifetime) Use Case We have two data centers - Operations DC and Analytics DC (to isolate the read and write workloads). At the end of this post is the diagram depicting the proposed architecture. Due to storage constraints, we can't store data generated over life-time on Operations DC. Hence, we are planning to move the data from Operations DC to Analytics DC as per defined policy (let's say after 1 week). Questions Is it possible to have one-way replication in Cassandra between Datacenters? Data from Operations DC moved into Analytics DC. But data stored after processing in Analytics DC shall not be replicated into Operations DC. Does Cassandra provide control on what gets replicated? We don't want both the DCs to be in synch. We want to configure what gets replicated (moved actually) into the Analytics DC. Is it possible inherently with Cassandra? If I want to specify that only data of the last one week should be replicated from Operations Data Centre to Analytics Data Center. We are planning to use Cassandra's inbuilt feature of time-to-live to delete the data (only from operations DC). The data deleted from Operations DC should not get deleted from the Analytics DC. How to prevent replication of deleted data? I have read that a single Cassandra node can handle up to 2-3 TB of data. Any documented reference of any larger Cassandra implementations will help. How many Cassandra nodes shall be deployed to handle such growth? And what shall be a recommended deployment strategy? Performance considerations: Although the storage at Operations DC will be limited (3-7 days data, about 5-10 TB), the data storage at Analytics DC is cumulative and continues to grow with time. Will the database growth at analytics DC affect the replication and degrade the performance of Operations DC. The purpose here is to know if Cassandra's inbuilt feature can be used to support the above requirements. I'm aware of the most obvious solution. Not to have replication between the two DC. Dump the data of last one week from Operations DC and move it to Analytics DC. Proposed architecture diagram

deploymentcassandradatabase-replicationcassandra-3.0

Reputation: 81

One way replication between Cassandra data centres

We are proposing Cassandra to be implemented as a database backend for a large archiving solution (a large number of writes compared to reads). We are looking for inputs on Cassandra's replication and deployment strategy to fit our use case.

The choice of Cassandra was based on following factors:

Supports Large throughput for ‘write’ operations - thousands of simultaneous writes per second
Suitability for Engineering data (mainly Time series data)
High availability to support continuous telescope operations
Tools support e.g. Analytics, Reporting

Data Estimates

250 TB of growth per year (50 years of System lifetime)

Use Case

We have two data centers - Operations DC and Analytics DC (to isolate the read and write workloads). At the end of this post is the diagram depicting the proposed architecture. Due to storage constraints, we can't store data generated over life-time on Operations DC. Hence, we are planning to move the data from Operations DC to Analytics DC as per defined policy (let's say after 1 week).

Questions

Is it possible to have one-way replication in Cassandra between Datacenters? Data from Operations DC moved into Analytics DC. But data stored after processing in Analytics DC shall not be replicated into Operations DC.
Does Cassandra provide control on what gets replicated? We don't want both the DCs to be in synch. We want to configure what gets replicated (moved actually) into the Analytics DC. Is it possible inherently with Cassandra? If I want to specify that only data of the last one week should be replicated from Operations Data Centre to Analytics Data Center.
We are planning to use Cassandra's inbuilt feature of time-to-live to delete the data (only from operations DC). The data deleted from Operations DC should not get deleted from the Analytics DC. How to prevent replication of deleted data?
I have read that a single Cassandra node can handle up to 2-3 TB of data. Any documented reference of any larger Cassandra implementations will help.
How many Cassandra nodes shall be deployed to handle such growth? And what shall be a recommended deployment strategy?
Performance considerations: Although the storage at Operations DC will be limited (3-7 days data, about 5-10 TB), the data storage at Analytics DC is cumulative and continues to grow with time. Will the database growth at analytics DC affect the replication and degrade the performance of Operations DC.

The purpose here is to know if Cassandra's inbuilt feature can be used to support the above requirements. I'm aware of the most obvious solution. Not to have replication between the two DC. Dump the data of last one week from Operations DC and move it to Analytics DC.

Proposed architecture diagram

Upvotes: 4

Answers (2)

Jeff Jirsa

Reputation: 4426

No
Yes, replication is configured per keyspace.
This won’t work out of the box, but it can be made to work. I can think of two relatively easy options. The easiest is to batch write to both keyspaces/DCs, one with TTLs and One without . You could potentially also make a keyspace per month/year, start with it replicating to multiple DCs, and remove the “normal” DC when appropriate.
Cassandra cluster - data density (data size per node) - looking for feedback and advises
Cassandra does ok up to about 800-1000 instances in a cluster, but it’s often advisable to shard smaller than that for your own ease of operation
DCs can be asymmetrical.

Upvotes: 2

Alex Ott

Reputation: 87299

I think that in your case is just makes sense to "separate" DCs - keyspaces in one DC aren't replicated into another DC, for example - just create keyspaces with necessary corresponding replication settings.

Or you can replicate "transactional" load into both DCs, and have a job that will periodically copy data from "transactional" keyspace into "analytics" keyspace, and then remove data from "transactional" keyspace to free the space.

But it's not really possible to have something like as you describe, until you use something like DSE's Advanced Replication (but it's not about DCs, but more about separate clusters).

Upvotes: 2

One way replication between Cassandra data centres

Answers (2)

Related Questions