Praful Bagai
Praful Bagai

Reputation: 17372

Cassandra explanations

I was learning Cassandra from Datastax. I've few questions on the same. Please help me understand the logic behind these.

Suppose, I've 2 tables.

1) User

First Name- Text
Last Name- Text
UserID - UUID PRIMARY KEY

2) Stocks

Stock ID PRIMARY KEY
UserID
ColXYZ

Now my doubts are:-

1) I need to filter User table with first_name = 'XYZ'. Its a bottleneck because since it is not a PM, hence I wont be able to filter it. Is there some reason behind this architecture.??

2) Since, I cannot filter by any column other than PM, how would I remember the UUID of the user?? For eg:- Lets say User XYZ has a UUID 7892hbwdw81212ww (something), How would I get to know the UUID of User XYZ at the first place ??? Since I cannot filter by any other column, I need to know the UUID of that user? How would I get to know that?

3) Since for RF>1, the co-ordinator forwards the requests to different nodes based on the topologies and then responds back to the client based on the latest time-stamp. What if 1 mode is slow in responding and that very node has the latest updated data? What will happen in that case?

4) Who decides that into which all nodes the data needs to be replicated? I know the co-ordinator forwards the request to the actual node,based on the partition key, where the data needs to be stored. But in which nodes the data will be replicated?

5) Also, using Cassandra is in itself a big task, as I can see that DB designing is a big task in it. Our model design should be perfect(which is not always possible for a newbie like me), should we seriously consider Cassandra as a datasource??

Upvotes: 1

Views: 180

Answers (2)

mahendra singh
mahendra singh

Reputation: 35

ans1. create index on first_name column like:- create index firstname on User(first_name); than you can select data by first_name please also provide 'allow filtering' at the end of select query .

ans2. ans1 first will solve the second problem .

ans3. If you set consitency level more than one than first cassandra will match data from nodes equal to consistency level than it will give updated data.

ans4. replication decided by cassandra based on the network distance .

ans5 .It will be easy after do some practice on cassandra .You can use it as datasource .

Upvotes: 0

ashic
ashic

Reputation: 6495

I'll (probably regrettably) bite.. user1162512 :)

  1. Cassandra is dedicated towards extreme data ingestion rates, and very very fast queries. It stores data in partitions, and partitions are stored and fetched together. Your primary key can have multiple fields. The first field of the PK is called the partition key, and that's what defines which partition some data is one. Advanced querying would require additional complexity, and it is for this reason, cassandra's querying capabilities are less (than say, SQL server). It is very strict in what it allows. You can query by partition key, and successive clustering keys (the remaining columns in your PK). You do these on exact equality, though you can do range queries on the last or "innermost" clustering column in a query. The reason for this is that within a partition, data is sorted by each successive clustering key. Say, your PK is (A, B, C, D). Then A defines the partition. In the partition, data is first sorted along B, and within that, data is sorted along C and then D. The reason for the strict requirements in querying is so that cassandra can identify a block of data and simply return that. The are options like secondary indexes, but almost always you'd want to hit a partition before using them. Think of each partition as a database. Would you do a query that would hit lots of databases? Would that be good for performance? The limitations are there to ensure sustained latencies in high performance scenarios. Yes, the querying capabilities are limited, but they do allow for use in quite a view use cases given a bit of data modelling. Data modelling in cassandra is query driven ... if your data model is built for your queries, you'll get very good performance. Query driven modelling is a mindshift, and very different to SQL like approaches.

  2. You'd create another table mapping user name to id. Denormalisation is quite common. Just remember, you must look to hit one, at most two partitions in a query. It'd be a lookup table. If you need more advanced searching, then use a proper search server like lucene, solr, etc. Then query cassandra with the key(s).

  3. Alongside RF, you have a notion of read and write consistency levels. You can control these per query. You can specify Read and Write cl so that Read + Write > RF. If you do that, you'll have strong consistency. If your read CL is 1 and RF > 1, you might get the stale data. This is where the notion of tunable consistency comes in.

  4. The partitioner selects the first partition. The replicas are chosen by the replication strategy.http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureDataDistributeAbout_c.html http://www.datastax.com/docs/1.0/cluster_architecture/replication

  5. It depends. If you know the types of queries (i.e. what sort of queries, not necessarily all of them), and need very fast ingestion, read, high availability, built in cross datacenter replication, horizontal scalability, tunable consistency, then cassandra is a very good data store. For more analytical workloads, you can pair it up with Apache Spark, which would allow you to get to the data in a more flexible manner, but won't be as fast as real time queries. You will need to put in some time to learn some of the ins and outs if you intend to use it in production, but I guess that goes with any technology. Check out the free videos on datastax academy for a good intro.

Hope that helps.

Upvotes: 2

Related Questions