uh_big_mike_boi
uh_big_mike_boi

Reputation: 3470

Cassandra and Spark

Hi I have a high level question regarding cluster topology and data replication with respect to cassandra and spark being used together in datastax enterprise.

It was my uderstanding that if there were 6 nodes in a cluster and there is heavy computing (e.g analytics) done then you could have three spark nodes and 3 cassandra nodes if you want. Or you don't need three nodes for analytics but your jobs would not run as fast. The reason you don't want the heavy analytics on the cassandra nodes is because the local memory is already being used up to handle the heavy read/write load of cassandra.

This much is clear, but here are my questions :

Upvotes: 1

Views: 1484

Answers (2)

phact
phact

Reputation: 7305

How does the replicated data work then?

Regular Cassandra replication will operate between nodes and DC's. As far as replication goes this is the same as having a c* only cluster with two data centers.

Are all the cassandra only nodes in one rack, and all the spark nodes in another rack?

With the default DSE Snitch, your C* nodes will be in one DC and the Spark nodes in another DC. They will all be in a default rack. If you want to use multiple racks you will have to configure that yourself by using an advanced snitch. GPFS or PFS are good choices depending on your orchestration mechanisms. Learn more in the DataStax Documentation

Does all the data get replicated to the spark nodes? How does that work if it does?

Replication is controlled at the keyspace level and depends on your replication strategy:

SimpleStrategy will simply ask you the number of replicas you want in your cluster (it is not data center aware so don't use it if you have multiple DC's)

create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3 }

This assumes you only have one DC and that you'll have 3 copies of each bit of data

NetworkTopology strategy let's you pick number of replicas per DC

create KEYSPACE tst WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2': 3 }

You can choose to have a different number of replicas per DC.

What is the recommended configuration steps to make sure the data is replicated properly to the spark nodes?

The procedure to update RF is in the datastax documentation. Here it is verbatim:

Updating the replication factor Increasing the replication factor increases the total number of copies of keyspace data stored in a Cassandra cluster. If you are using security features, it is particularly important to increase the replication factor of the system_auth keyspace from the default (1) because you will not be able to log into the cluster if the node with the lone replica goes down. It is recommended to set the replication factor for the system_auth keyspace equal to the number of nodes in each data center.

Procedure

Update a keyspace in the cluster and change its replication strategy options. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2}; Or if using SimpleStrategy:

ALTER KEYSPACE "Excalibur" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; On each affected node, run the nodetool repair command. Wait until repair completes on a node, then move to the next node.

Know that increasing the RF in your cluster will generate lots of IO and CPU utilization as well as network traffic, while your data gets pushed around your cluster.

If you have a live production workload, you can throttle the impact by using nodetool getstreamthroughput / nodetool setstreamthroughput.

You can also throttle the resulting compactions with nodetool getcompactionthroughput nodetool setcompactionthroughput

How does Cassandra and Spark work together on the analytics nodes and not fight for resources? If you are not going to limit Cassandra at all in the whole cluster, then what is the point of limiting Spark, just have all the nodes Spark enabled.

The key point is that you won't be pointing your main transactional reads / writes at the Analytics DC (use something like consistency level ONE_LOCAL, or QUORUM_LOCAL to point those requests to the C* DC). Don't worry, your data still arrives at the analytics DC by virtue of replication, but you won't wait for acks to come back from analytics nodes in order to respond to customer requests. The second DC is eventually consistent.

You are right in that cassandra and spark are still running on the same boxes in the analytics DC (this is critical for data locality) and have access to the same resources (and you can do things like control the max spark cores so that cassandra still has breathing room). But you achieve workload isolation by having two Data Centers.

DataStax drivers, by default, will consider the DC of the first contact point they connect with as the local DC so just make sure that your contact point list only includes machines in the local (c* DC).

You can also specify the local datacenter yourself depending on the driver. Here's an example for the ruby driver, check the driver documentation for other languages.

use the :datacenter cluster method: First datacenter found will be assumed current by default. Note that you can skip this option if you specify only hosts from the local datacenter in :hosts option.

Upvotes: 2

HashtagMarkus
HashtagMarkus

Reputation: 1661

You are correct, you want to separate your cassandra and your analytics workload. A typical setup could be:

  • 3 Nodes in one datacenter (name: cassandra)
  • 3 Nodes in second datacenter (name: analytics)

When creating your keyspaces you define them with a NetworkTopologyStrategy and a replication factor defined for each datacenter, like so:

CREATE KEYSPACE myKeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'cassandra': 2, 'analytics': 2};

With this setup, your data will be replicated twice in each datacenter. This is done automatically by cassandra. So when you insert data in DC cassandra the inserted data will get replicated to DC analytics automatically and vice versa. Note: you can define what data is replicated by using seperate keyspaces for the data you want to be analyzed and the data you don't.

In your cassandra.yaml you should use the GossipingPropertyFileSnitch. With this snitch you can define the DC and the rack of your node in the file cassandra-rackdc.properties. This information then gets propagated via the gossip protocol. So each node learns the topology of your cluster.

Upvotes: 2

Related Questions