RaceBase
RaceBase

Reputation: 18848

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!

In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].

The present data model is simple as this

ID (integer) |  ENTITY_ID (integer)  |  ENTITY_TYPE (String)  | ENTITY_DATA (Text)  | CREATED_ON (Date) | VERSION (interger)|

We can safely assume that this part of application is similar to Logging of the Activity! I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.

Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level, I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!

After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!

No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!

References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

Upvotes: 1

Views: 254

Answers (1)

ashic
ashic

Reputation: 6495

Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:

create table event(
   id uuid,
   timestamp timeuuid,
   some_column text,
   some_column2 list<text>,
   some_column3 map<text, text>,
   some_column4 map<text, text>,
   primary key (id, timestamp .... );

Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.

In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.

One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.

You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).

Lastly, if your data is time series log data, cassandra is a very very good choice.

Upvotes: 2

Related Questions