data modelling in cassandra to optimize search results

Question

I was just wondering if I could get some clue/pointers to our kind of simple data modelling problem. It would be great if somebody can help me in the right direction.

So we have kind of a flat table ex. document which has all kinds of meta data attached to a document like UUID documentId, String organizationId, Integer totalPageCount, String docType, String acountNumber, String branchNumber, Double amount, etc etc...

which we are storing in cassandra . UUID is the rowkey and we have certain secondary indexes like organization Id.

This table is actaully suppose hold millions of records. Placing proper indices helps with a lot of queries but with the generic queries I am stuck. The problem is even with something like 100k records if I throw in a query like select * from document where orgId='something' and amount > 5 and amount < 50 ...I am begining to see all Read time out problems. The query still works (although quite slow) if I limit the no of records to something lets say 2000.

The above can be solved by probably placing certain parmas properly but there about dozens of those columns based on which we need to search.

I am still trying to scale it horizontally so to place mutiple records in a single row.

Hoping for a sense of direction.

ashic · Accepted Answer

This is a broad problem, and general solutions are hard to give. However, here's my 2 pennies:

You WANT queries to hit single partitions for quick querying. If you don't hit a rowkey in your query, it's a cluster wide operation. So select * from docs where orgId='something' and amount > 5 and amount < 50 means you will have issues. Hitting a partition key AND an index is way way better than hitting the index without the partition key.

Again, you don't want all docs in a single partition...that's an obvious hotspot, not to mention it can cause size issues - keeping a row around the 100mb mark is a good idea. Several thousand or even several hundred thousand metadata entries per row should be fine - though much of this depends on your specific data.

So we want to hit partition keys, but also want to take advantage of distribution, while preserving efficiency. Hmmm.....

You can create artificial buckets. Decide how many buckets you want, based on expected data volumes. Assuming a few hundred thousand per partition, n buckets gives you n * hundreds of thousands. Make the bucket id the row key. When querying, use something like:

select * from documents where bucketid in (...) and orgId='something' and amount > 5;

[Note: for this, you may want to make the docid the last clustering key, so you don't have to specify it when doing the range query.]

That will result in n fast queries hitting n partitions, where n is the number of buckets.

Also, consider limiting your results. Do you really need 2000 records at a time?

For some information, it may make sense to have separate tables (i.e. some information with one particular clustering scheme in one table, and another in another). Duplication of some information is often ok - but again, this depends on particular scenarios.

Again, it's hard to give a general answer. But does that help?

data modelling in cassandra to optimize search results

Answers (2)

Related Questions