stack_usr
stack_usr

Reputation: 11

data modelling in cassandra to optimize search results

I was just wondering if I could get some clue/pointers to our kind of simple data modelling problem. It would be great if somebody can help me in the right direction.

So we have kind of a flat table ex. document which has all kinds of meta data attached to a document like UUID documentId, String organizationId, Integer totalPageCount, String docType, String acountNumber, String branchNumber, Double amount, etc etc...

which we are storing in cassandra . UUID is the rowkey and we have certain secondary indexes like organization Id.

This table is actaully suppose hold millions of records. Placing proper indices helps with a lot of queries but with the generic queries I am stuck. The problem is even with something like 100k records if I throw in a query like select * from document where orgId='something' and amount > 5 and amount < 50 ...I am begining to see all Read time out problems. The query still works (although quite slow) if I limit the no of records to something lets say 2000.

The above can be solved by probably placing certain parmas properly but there about dozens of those columns based on which we need to search.

I am still trying to scale it horizontally so to place mutiple records in a single row.

Hoping for a sense of direction.

Upvotes: 0

Views: 435

Answers (2)

oleksii
oleksii

Reputation: 35935

The problem is not in Cassandra, but in your data model. You need to shift from relation thinking, to a nosql-cassandra thinking. In Cassandra, you write your queries first if you want to get decent O(1) speed. Using secondary indexes in Cassandra is frankly a poor choice. This is due to the fact that your indexes are distributed.

If you don't know your queries upfront, use other technology but not Cassandra. Relational servers are really good, if you can fit all data on 1 server, otherwise have a look at ElasticSearch.

Other option is to use Datastax edition, which contains Solr for full text search.

Lastly, you can have several tables that duplicate information. This will allow you to query for a specific property . This process is called de-normalisation and the idea is that you take a property of your object, make it a primary key and insert it into its own table. The outcome is that you can query that particular table, for that particular property value in O(1) time. The downside is that you now have to duplicate data.

Upvotes: 1

ashic
ashic

Reputation: 6495

This is a broad problem, and general solutions are hard to give. However, here's my 2 pennies:

You WANT queries to hit single partitions for quick querying. If you don't hit a rowkey in your query, it's a cluster wide operation. So select * from docs where orgId='something' and amount > 5 and amount < 50 means you will have issues. Hitting a partition key AND an index is way way better than hitting the index without the partition key.

Again, you don't want all docs in a single partition...that's an obvious hotspot, not to mention it can cause size issues - keeping a row around the 100mb mark is a good idea. Several thousand or even several hundred thousand metadata entries per row should be fine - though much of this depends on your specific data.

So we want to hit partition keys, but also want to take advantage of distribution, while preserving efficiency. Hmmm.....

You can create artificial buckets. Decide how many buckets you want, based on expected data volumes. Assuming a few hundred thousand per partition, n buckets gives you n * hundreds of thousands. Make the bucket id the row key. When querying, use something like:

select * from documents where bucketid in (...) and orgId='something' and amount > 5;

[Note: for this, you may want to make the docid the last clustering key, so you don't have to specify it when doing the range query.]

That will result in n fast queries hitting n partitions, where n is the number of buckets.

Also, consider limiting your results. Do you really need 2000 records at a time?

For some information, it may make sense to have separate tables (i.e. some information with one particular clustering scheme in one table, and another in another). Duplication of some information is often ok - but again, this depends on particular scenarios.

Again, it's hard to give a general answer. But does that help?

Upvotes: 1

Related Questions