Reputation: 2879
I am using Cassandra to store stocks information. Each 'row' has some base fields like: time, price, close, open, low, high, etc. on top of these fields I have a list of floats-typed values which contains some internal system calculations.
Example for an object:
Class stockentry
time timestamp;
price float;
close float;
open float;
low float;
high float;
x float;
y float;
z float;
xx2 float;
xx3 float;
xx... yy... z...
a lot more...
Creating a lot of columns in a column family and storing all this data is no problem with Cassandra. The problem is querying it. I would like to query on fields like x,y,xx2.. and these fields contains a very unique data values (floats with 4 decimal places).
Adding all these columns (100-150) as secondary indexes is not likely to be a good solution and is not recommended by the Cassandra docs.
What is the recommended data modeling, considering the requirements, when working with Cassandra?
Upvotes: 2
Views: 3571
Reputation: 399
Cassandra data modeling follows a query-driven design pattern. What this means is that instead of building a model to naturally represent the data (as we might in an RDBMS), we design schemas to accomodate data access patterns instead.
So for example, if you knew that the majority of your queries would involve a where clause on the column x, and ordered by the rows in column y, you might want to create an additional table in which the partition key was x, and the clustering column was y. For example:
CREATE TABLE <tablename>
"x" float,
"y" float,
"price" float,
.
.
<rest of columns>
.
.
PRIMARY KEY("x","y"));
Now, querying in column x becomes very efficient as the data for a particular value of x is stored together.
For queries in which a range of values is required (x> pricerange), you would be wise to store them as clustering columns.
Admittedly, this leads to multiple writes, as the values in columns x and y must be written across both tables. Cassandra encourages writes as storing data in this day and age is cheap. Essentially, in Cassandra you trade off additional writes for blazing fast reads.
Therefore, before designing your data model, think about what kind of queries you would most likely be doing and design accordingly.
Upvotes: 2
Reputation: 9734
CREATE TABLE pricing(
id blob,
price_tag string, // open, close, high, low, ...
time timestamp,
value float, // I would suggest blob with custom/thrift serialization
PRIMARY KEY (id, price_tag, time)
)
It will give very efficient queries for different price types over time.
You can find more in great presentation: http://www.slideshare.net/carlyeks/nyc-big-tech-day-2013?ref=http://techblog.bluemountaincapital.com/
Upvotes: 0