NG Algo
NG Algo

Reputation: 3890

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.

We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us. So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?

Upvotes: 2

Views: 2063

Answers (2)

user3492029
user3492029

Reputation: 1

There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.

Upvotes: 0

Charles Menguy
Charles Menguy

Reputation: 41428

You should take a look at sqoop, it has an integration with Cassandra as shown here.

This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:

  • Slice your dataset into different partitions.
  • Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.

So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.

As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.

I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.

Upvotes: 1

Related Questions