Reputation: 801
I Know cassandra doesn't support joins, so to use cassandra we need to denormalize tables. I would like to know how? Suppose I have two tables
<dl>
<dt>Publisher</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>TimeStamp</dd>
<dd>Address</dd>
<dd>PhoneNo</dd>
<dt>Book</dt>
<dd>Id : <i>Primary Key</i></dd>
<dd>Name</dd>
<dd>ISBN</dd>
<dd>Year</dd>
<dd>PublisherId : <i>Foreign Key - Referenes Publisher table's Id</i></dd>
<dd>Cost</dd>
</dt>
</dl>
Please let me know how can I denormalize these tables in order to achieve the following operations efficiently
1. Search for all Books published by a particular publisher.
2. Search for all Publishers who published books in a given year.
3. Search for all Publishers who has not published books in a given year.
4. Search for all Publishers who has not published books till now.
I saw few articles regarding cassandra. But not able to conclude the denormalize for above operations. Please help me.
Upvotes: 2
Views: 284
Reputation: 57748
This sounds like it could get huge, so I'll take the first one and walk through how I would approach it. You don't have to do it this way, it's just one way to go about it. Note that you may have to create query tables for each of your 4 scenarios above. This table will solve for the first scenario only.
First of all, I'll create a type for publisher address.
CREATE TYPE address (
street text,
city text,
state text,
postalCode text
);
Next I'll create a table called booksByPublisher
. I'll use my address
type for publisherAddress
. And I'll build my PRIMARY KEY with publisherid
as the partition key, clustering on bookYear
and isbn
.
As you want to be able to query all books by a particular publisher, it makes sense to designate that as the partition key. It may prove helpful to have your results sorted year, or at the very least be able to look at a specific year for a specific publisher, so I have bookYear
as the first clustering key. And of course, to create a unique CQL row for each book within a publisher, I'll add isbn
for uniqueness.
CREATE TABLE booksByPublisher (
publisherid UUID,
publisherName text,
publisherAddress frozen<address>,
publisherPhoneNo text,
bookName text,
isbn text,
bookYear bigint,
bookCost bigint,
bookAuthor text,
PRIMARY KEY (publisherid, bookYear, isbn)
);
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Ready Player One','978-0307887443',2005,812,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (b7b99ee9-f495-444b-b849-6cea82683d0b,'Crown Publishing',{ street: '1745 Broadway', city: 'New York', state:'NY', postalcode: '10019'},'212-782-9000','Armada','978-0804137256',2015,1560,'Ernest Cline');
INSERT INTO booksByPublisher (publisherid, publishername, publisheraddress, publisherphoneno, bookname, isbn, bookyear, bookcost, bookauthor)
VALUES (uuid(),'The Berkley Publishing Group',{ street: '375 Hudson Street', city: 'New York', state:'NY', postalcode: '10014'},'212-333-2354','Rainbox Six','978-0425170342',1999,867,'Tom Clancy');
Now I can query all books (out of my 3 rows) published by Crown Publishing (publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b
) like this:
aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+------------------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2005 | 978-0307887443 | Ernest Cline | 812 | Ready Player One | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(2 rows)
If I want, I can also query for all books by Crown Publishing during 2015:
aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher
WHERE publisherid=b7b99ee9-f495-444b-b849-6cea82683d0b AND bookyear=2015;
publisherid | bookyear | isbn | bookauthor | bookcost | bookname | publisheraddress | publishername | publisherphoneno
--------------------------------------+----------+----------------+--------------+----------+----------+-------------------------------------------------------------------------------+------------------+------------------
b7b99ee9-f495-444b-b849-6cea82683d0b | 2015 | 978-0804137256 | Ernest Cline | 1560 | Armada | {street: '1745 Broadway', city: 'New York', state: 'NY', postalcode: '10019'} | Crown Publishing | 212-782-9000
(1 rows)
But I cannot query by just bookyear
:
aploetz@cqlsh:stackoverflow2> SELECT * FROM booksbypublisher WHERE bookyear=2015;
InvalidRequest: code=2200 [Invalid query] message="Cannot execute this query as it might
involve data filtering and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"
And don't listen to the error message and add ALLOW FILTERING
. That might work fine for a table with 3 rows (or even 300). But it won't work for a table with 3 million rows (you'll get a timeout). Cassandra works best when you query by a complete partition key. As publisherid
is our partition key, this query will perform just fine. But if you need to query by bookYear
, then you should create a table which uses bookYear
as its partitioning key.
Upvotes: 0
Reputation: 9475
Designing a whole schema is a rather big task for one question, but in general terms denormalization means you will repeat the same data in multiple tables so that you can read a single row to get all the data you need for each type of query.
So you would create a table for each type of query, something along these lines:
So within each row of the data, you would repeat all the data you wanted to get back with the query (i.e. the union of all the columns in your example tables). When you insert a new book, you would insert it into all of your tables.
Upvotes: 1