Srinivasan MK
Srinivasan MK

Reputation: 801

Cassandra - How to denormalize two joined tables?

I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how? Suppose I have two tables

<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
  <dd>Name</dd>
  <dd>TimeStamp</dd>
  <dd>Address</dd>
  <dd>PhoneNo</dd>
  
  <dt>Book</dt>
  <dd>Id : <i>Primary Key</i></dd>
  <dd>Name</dd>
  <dd>ISBN</dd>
  <dd>Year</dd>
  <dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
  <dd>Cost</dd>
  </dt>
</dl>

Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.

I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.

Upvotes: 2

Views: 284

Answers (2)

Aaron
Aaron

Reputation: 57748

This sounds like it could get huge, so I'll take the first one and walk through how I would approach it. You don't have to do it this way, it's just one way to go about it. Note that you may have to create query tables for each of your 4 scenarios above. This table will solve for the first scenario only.

First of all, I'll create a type for publisher address.

CREATE TYPE address (
  street text,
  city text,
  state text,
  postalCode text
);

Next I'll create a table called booksByPublisher. I'll use my address type for publisherAddress. And I'll build my PRIMARY KEY with publisherid as the partition key, clustering on bookYear and isbn.

As you want to be able to query all books by a particular publisher, it makes sense to designate that as the partition key. It may prove helpful to have your results sorted year, or at the very least be able to look at a specific year for a specific publisher, so I have bookYear as the first clustering key. And of course, to create a unique CQL row for each book within a publisher, I'll add isbn for uniqueness.

CREATE TABLE booksByPublisher (
  publisherid UUID,
  publisherName text,
  publisherAddress frozen<address>,
  publisherPhoneNo text,
  bookName text,
  isbn text,
  bookYear bigint,
  bookCost bigint,
  bookAuthor text,
  PRIMARY KEY (publisherid, bookYear, isbn)
);

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');

INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');

Now I can query all books (out of my 3 rows) published by Crown Publishing (publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b) like this:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher 
    WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;

 publisherid                          | bookyear | isbn           | bookauthor   | bookcost | bookname         | publisheraddress                                                              | publishername    | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2005 | 978-0307887443 | Ernest Cline |      812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2015 | 978-0804137256 | Ernest Cline |     1560 |           Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000

(2 rows)

If I want, I can also query for all books by Crown Publishing during 2015:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
    WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;

 publisherid                          | bookyear | isbn           | bookauthor   | bookcost | bookname | publisheraddress                                                              | publishername    | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
 b7b99ee9-f495-444b-b849-6cea82683d0b |     2015 | 978-0804137256 | Ernest Cline |     1560 |   Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing |     212-782-9000

(1 rows)

But I cannot query by just bookyear:

aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might 
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"

And don't listen to the error message and add ALLOW FILTERING. That might work fine for a table with 3 rows (or even 300). But it won't work for a table with 3 million rows (you'll get a timeout). Cassandra works best when you query by a complete partition key. As publisherid is our partition key, this query will perform just fine. But if you need to query by bookYear, then you should create a table which uses bookYear as its partitioning key.

Upvotes: 0

Jim Meyer
Jim Meyer

Reputation: 9475

Designing a whole schema is a rather big task for one question, but in general terms denormalization means you will repeat the same data in multiple tables so that you can read a single row to get all the data you need for each type of query.

So you would create a table for each type of query, something along these lines:

  1. Create a table partitioned by publisher id and with book id as a clustering column.
  2. Create a table partitioned by year and with publisher id as a clustering column.
  3. Create a table with a list of all publishers. In an application you could then read this list and programmatically subtract the rows present in the desired year read from the table 2.
  4. I'm not sure what "published till now" means. When you insert a new book, you could check if the publisher is present in table 3. If not, then it's a new publisher.

So within each row of the data, you would repeat all the data you wanted to get back with the query (i.e. the union of all the columns in your example tables). When you insert a new book, you would insert it into all of your tables.

Upvotes: 1

Related Questions