farhawa
farhawa

Reputation: 10417

Cassandra table based query and primary key uniqueness

I have read here that for a table like:

CREATE TABLE user (
    username text,
    password text,
    email text,
    company text,
    PRIMARY KEY (username)
);

We can create a table like:

CREATE TABLE user_by_company (
    company text,
    username text,
    email text,
    PRIMARY KEY (company)
);

In order to support query by the company. But what about primary key uniqueness for the second table?

Upvotes: 1

Views: 67

Answers (2)

Aaron
Aaron

Reputation: 57843

Modify your table's PRIMARY KEY definition and add username as a clustering key:

CREATE TABLE user_by_company (
    company text,
    username text,
    email text,
    PRIMARY KEY (company,username)
);

That will enforce uniqueness, as well as return all usernames for a particular company. Additionally, your result set will be sorted in ascending order by username.

data will be partitioned by the company name over nodes. What if there is a lot of users from one company and less from other one. Data will be partition'ed in a non balanced way

That's the balance that you have to figure out on your own. PRIMARY KEY definition in Cassandra is a give-and-take between data distribution and query flexibility. And unless the cardinality of company is very low (like single digits), you shouldn't have to worry about creating hot spots in your cluster.

Also, if one particular company gets too big, you can use a modeling technique known as "bucketing." If I was going to "bucket" your user_by_company table, I would first add a company_bucket column, and it as an additional (composite) partitioning key:

CREATE TABLE user_by_company (
    company text,
    company_bucket text,
    username text,
    email text,
    PRIMARY KEY ((company,company_bucket),username)
);

As for what to put into that bucket, it's up to you. Maybe that particular company has East and West locations, so something like this might work:

INSERT INTO user_by_company (company,company_bucket,username,email)
  VALUES ('Acme','West','Jayne','[email protected]');

The drawback here, is that you would then have to provide company_bucket whenever querying that table. But it is a solution that could help you if a company should get too big.

Upvotes: 1

undefined_variable
undefined_variable

Reputation: 6228

I think there is typo in the blog (the link you mentioned). You are right with the table structure as user_by_company there will be issue with uniqueness.

To support the typo theory:

In this case, creating a secondary index in the company field in the user table could be a solution because it has much lower cardinality than the user's email but let’s solve it with performance in mind. Secondary indexes are always slower than dedicated table approach.

This are the lines mentioned in the blog for querying user by company. If you were to define company as primary key OR part of primary key there should be no need to create secondary index.

Upvotes: 1

Related Questions