How many tables/Column Families are created in cassandra for this example

Question

I am reading this post on schema's in cassandra.

The author creates two tables:

CREATE TABLE tweets (
    tweet_id uuid PRIMARY KEY,
    author varchar,
    body varchar
);

CREATE TABLE timeline (
    user_id varchar,
    tweet_id uuid,
    author varchar,
    body varchar,
    PRIMARY KEY (user_id, tweet_id)
);

Note: As for the tables are concerned, they don't know that both table can be "JOINED" on tweet_id. Each table sees tweet_id as a unique column name of type uuid.

If my understanding of the post is here, the author says that there are no two column families aka table created physically. It is just ONE HUGE table that contains information for both the logical column families.

But how does the look up happen when I say select * from tweets where tweet_id="xxx" (is there an internal marker to determine the columns belongs to tweets)

Please look at the post, as the author illustrates with the good examples.

My question is how does tweet_id in table timeline knows it should "join" with tweet_id in table tweets.

Aaron · Accepted Answer

No, it is not created as one column family. Both column families are created separately, and operate independently of each other. What the author is referring to, is the aspect of non-relational data modeling that involves denormalizing your data and creating tables that match your query patterns.

When a "tweet" is made, the application has to be designed to store data about the tweet into two different column families. It stores once in the tweets column family, and then an entry is made into the timeline column family for each follower. Essentially, data about a particular tweet is being duplicated once for the tweets column family, and once for every follower that the author has.

how does tweet_id in table timeline knows it should "join" with tweet_id in table tweets.

Simple, it doesn't know that. Cassandra does not allow joins, and a properly-designed application backed by Cassandra will not employ client-side joins, either. Again, each column family is designed in anticipation of each query that might be run. Sometimes, the application may want to query a specific tweet by tweet_id, and it would use the tweets column family for that. On the other hand, the post mentions that the application has a use case to query the 20 most-recent tweets from a particular user," in which case the timeline column family is designed to handle that.

Summary:

There are two column families being defined.
Each column family is designed to handle a specific query.
There are no joins; database or client-side. The data is denormalized (duplicated) so that the application can quickly query the data in the way that it is required.

How many tables/Column Families are created in cassandra for this example

Answers (1)

Related Questions