Reputation: 995
In Cassandra, can we "fix" the node in which a specific partition key resides to optimize fetches?
This is optimization for a specific keyspace and table where data written by one data center is never read by clients on a different data center. If a particular partition key will be queried only in specific data center, is it possible to avoid network delays by "fixing" it to nodes of same data center where it was written?
In other words, this is a use case where the schema is common across all data centers, but the data is never accessed across data centers. One way of doing this is to make the data center id as the partition key. However, a specific data center's data need/should not be placed in other data centers. Can we optimize by somehow specifying cassandra the partition key to data center mapping?
Is a custom Partitioner the solution for this kind of use case?
Upvotes: 3
Views: 1242
Reputation: 995
Data is too volumninous to be replicated across all data centers. Hence I am resorting to creating a keyspace per data center.
CREATE KEYSPACE "MyLocalData_dc1"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1, dc3:0, dc4: 0};
CREATE KEYSPACE "MyLocalData_dc2"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 0, 'dc2' : 3, dc3:1, dc4: 0};
This way, MyLocalData generated by datacenter 1 has one backup in datacenter 2. And data generated by datacenter2 is backed up in data center 3. Data is "fixed" in the data center it is written in and accessed from. Network latencies are avoided.
Upvotes: 2
Reputation: 48692
Cassandra determines which node at which to store a row using a partioner strategy. Normally you use a partitioner, such as the Murmur3 partitioner, that distribute rows effectively randomly and thus uniformly. You can write and use your own partitioner, in Java. That said, you should be cautious about doing this. Do you really want to assign a row to a specific node.
Upvotes: 2
Reputation: 57788
You should be able to use Cassandra's "datacenter awareness" to solve this. You won't be able to get it to enforce that awareness at the row level, but you can do it at the keyspace level. So if you have certain keyspaces that you know will be accessed only by certain localities (and served by specific datacenters) you can configure your keyspace to replicate accordingly.
In the cassandra-topology.properties file you can define which of your nodes is in which rack and datacenter. Then, make sure that you are using a snitch (in your cassandra.yaml
) that will respect the topology entries (ex: propertyFileSnitch).
Then when you create your keyspace, you can define the replication factor on a per-datacenter basis:
CREATE KEYSPACE "Excalibur"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
To get your client applications to only access certain datacenters, you can specify a LOCAL
read consistency (ex: LOCAL_ONE
or LOCAL_QUORUM
). This way, your client apps in one area will only read from a particular datacenter.
a specific data center's data need/should not be placed in other data centers.
While this solution won't solve this part of your question, unless you have disk space concerns (which in this day and age, you shouldn't) having extra replicas of your data can save you in an emergency. If you should lose one or all nodes in a particular datacenter and have to rebuild them, a cluster-wide repair will restore your data. Otherwise if keeping the data separate is really that important, you may want to look into splitting the datacenters into separate clusters.
Upvotes: 3