mtyson
mtyson

Reputation: 8570

Cassandra: Loading bulk test data in a cluster

Running cassandra in a two node cluster (that will expand in the future).

I want to load a couple million rows of load test data.

I have a python script that will do this.

Should I run the script on both nodes at the same time? Or should I do it on one and allow Cassandra to replicate?

(The cluster is on AWS EC2 servers in different regions).

Upvotes: 0

Views: 572

Answers (2)

Evan Volgas
Evan Volgas

Reputation: 2911

I have a python script that will do this.

I would recommend taking a look at the Python Cassandra driver by DataStax (eg http://datastax.github.io/python-driver/getting_started.html#connecting-to-cassandra) and following the instructions they provide for the actual writes themselves. There's some async methods in that driver that come in handy.

As far as the actual write itself and whether or not you should write to both nodes, no, you should not. This is a job for the snitch and the gossiper.

Assuming you are using Datastax Community Edition (btw, you should), you should take advantage of the EC2 Multiregion Snitch (http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchEC2MultiRegion_c.html). It's very straightforward to install and use.

Upvotes: 1

mikea
mikea

Reputation: 6667

If you have 2 nodes in a cluster then you don't need to insert the data on both nodes. Cassandra distributes the data across the nodes based on the partition key and the replication factor.

A good starting point for understanding data distribution in cassandra is here:

Upvotes: 2

Related Questions