Lucas Franceschi
Lucas Franceschi

Reputation: 428

How to model for repeated information on many records on cassandra

I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:

I have a generic table:

CREATE TABLE readings (
    key int,
    key2 int,
    time timestamp,
    name text,
    PRIMARY KEY ((key, key2) time)
)

This table has 700.000.000+ records. I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.

The queries I want to run that I can't now are:

A solution would be for me to create a table such as:

CREATE TABLE readings_per_source(
    source text,
    key int,
    key2 int,
    time timestamp,
    PRIMARY KEY (source, key, key2, time)
)

which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.

If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.

How does one go about modelling this in cassandra?

Upvotes: 2

Views: 720

Answers (2)

doanduyhai
doanduyhai

Reputation: 8812

Your schema

CREATE TABLE readings_per_source(
    source text,
    key int,
    key2 int,
    time timestamp,
    PRIMARY KEY (source, key, key2, time)
)

is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots

For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table

For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.

Do you realize that this query can potentially return tens of millions of records ?

If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column

CREATE TABLE readings (
    key int,
    key2 int,
    time timestamp,
    name text,
    source text,
    PRIMARY KEY ((key, key2), time)
)

CREATE CUSTOM INDEX source_idx ON readings(source) 
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
     'mode': 'PREFIX', 
     'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
     'case_sensitive': 'false'
};

Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)

Upvotes: 3

Neil Davis
Neil Davis

Reputation: 236

http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.

normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.

There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.

Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.

This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!

There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.

Sorry for rambling...

Upvotes: 1

Related Questions