Akash W
Akash W

Reputation: 371

How to compare Hive and Cassandra data in Java when there are around 1 million records

I am using Hive and Cassandra, table structure and data is the same in both Hive and Cassandra. There will be almost 1 million records. My requirement is that I need to check if each and every row has the same data in both Cassandra and Hive.

  1. Can I compare two resultset objects directly? (one resultset with Cassandra data and another from Hive)
  2. If we are iterating over resultset object, can resultset object hold 1 million records at a time? Will there be any performance issue?
  3. What do we need to take care of when dealing with such huge data?

Upvotes: 0

Views: 209

Answers (1)

S. Stas
S. Stas

Reputation: 810

Well, some initial conditions seem strange for me. First, 1M records is not a big deal for modern RDBMS, especially when we don't want to have real-time query responses. Second, the fact that Hive and Cassandra tables structure are the same. Cassandra's paradigm is query-first modeling and it is good for some scenarios others than Hive.
However, for your question:
1. Yes. You can write Java (as I saw Java in the tag list) program, that would connect to both Hive and Cassandra via JDBC and compare resultset items one by one.
But you need to be sure that order of items is the same for Hive and Cassandra. That could be done via Hive queries as there not too many ways to do Cassandra ordering.
2. Resultset is just a cursor. It doesn't gather the whole data in memory, just some batch of records (it is configurable).
3. 1M or records it not a huge data, however, if you want to deal with billions of records, that would be it. But I could not provide you with a silver bullet to answer all questions dealing with huge data as each case is specific.

Anyway, for your case, I have some concerns:
I have no details of latest Cassandra's JDBC driver features and limitations.
You have not provided details of table structure and future data growth and complexity. I mean that now you have 1M rows with 10 columns in a single database, but later you could have 100M rows in the cluster of 10 Cassandra nodes.
If it's not a problem, then you can try your solution. Otherwise, for the simplicity of comparison, I'd suggest do the following:
1. Export Cassandra's data to Hive.
2. Compare data in two Hive tables.
I believe that would be straightforward and more robust.

But all above doesn't address the thing about the tools (Hive and Cassandra) selection for your task. You could find more about typical Cassandra usage cases here to be sure you've made the right choice.

Upvotes: 2

Related Questions