raluca
raluca

Reputation: 153

Denormalize two Cassandra column families with existing data in Cassandra

I have two column families in Cassandra with the same structure:

  CREATE TABLE keyspace.table (
        id bigint,
        firstname text,
        lastname text,
        countryId bigint,
        cityId bigint,
        solr_query text,
        PRIMARY KEY (id)
    ) ;

In the second table the number of rows is three times bigger than in the first table. For example in the first table I would have the row (firstname, lastname, countryId):

John, Doe, 100, 8

and in the second table I would have the rows (firstname, lastname, countryId):

John, Doe, 222, 1
John, Doe, 100, 8
John, Doe, 333, 9

I need to create only one column family using collections so that I would have only one line with John Doe with the primary countryId (100), and a collection of the other ones. I added a user-defined data_type:

CREATE TYPE hints_info (
  countryId bigint,
  cityId bigint
);

and added a collection to my first column family:

ALTER TABLE table1 ADD hints map<int, frozen <hints_info>>;

So for the example above my row would look like:

John, Doe, 100, "{1: {countryId :222, cityId :1}, 2: {countryId :333, cityId :9}}"

Can I do this from Cassandra? using CQL? To take each row from the first column family, match it with a row from the second column family and add the missing info that don't exist in the first column family, in a collection? All my data is imported in Cassandra. If I would use C# for example, what would be the syntax to update my collection from my existing column family, with the data from my second column family?

Thank you!

Upvotes: 2

Views: 231

Answers (1)

Nenad Bozic
Nenad Bozic

Reputation: 3784

Cassandra does not support such thing where you read from one table and insert in another. CQL is suited for simple read/write operations and does not have complex query support as in relational world.

I would advise to make third table with desired structure and copy data from 2 original tables slowly until you are satisfied. Reason for this is that you can leave you application to work with existing table structure, migrate in background and you have data in original tables if something goes wrong. After you are satisfied you just change your application logic to read from new table.

As for migration you have couple of options:

  1. Handle this in you application as backgorund job which will pull from both tables, do application logic and insert in new table
  2. use Spark for this which has Cassandra connector and has tools for data transformation and can return data as you want in new table
  3. we created small migration tool which does exactly this, it has Schema and Data migrations and you can write Data migration which pulls from two original tables and writes to new one

Option 2. has much overhead just for this use case since you must install and understand whole new tool (Spark) so that was reason why we created migration tool for C#, Java and Scala for this kind of small ETL tasks. Option 1. is easiest and if you have infrastructure in your code to do background jobs go for it.

Upvotes: 0

Related Questions