Eric Fulton
Eric Fulton

Reputation: 159

Cassandra Modeling for filter and range queries

I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.

I want to be able to write queries like these:

get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds

or

get all users who are men who are 6'0" are ages 31-37 and have black hair

How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.

EDIT:

Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?

IF I UNDERSTAND THINGS CORRECTLY If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?

Upvotes: 0

Views: 987

Answers (3)

Eric Fulton
Eric Fulton

Reputation: 159

Just to reiterate the end of the conversation:

"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"

Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.

Upvotes: 0

Gomes
Gomes

Reputation: 3330

my recommendation is :

1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,

2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.

3) use multiple query to get what you need.

4) last option is, use DSE solr search capability.

Upvotes: 0

bechbd
bechbd

Reputation: 6341

So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:

http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause

or my long answer here:

cassandra - how to perform table query?

All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.

Edited with Additional Information Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:

CREATE TABLE user_by_haircolor_weight_height ( haircolor text, weight float, height_in int, user varchar, PRIMARY KEY ((haircolor), weight, height_in) );

You could then query this by: SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;

For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a PRIMARY KEY ((haircolor, sex), height_in, age)

In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.

Upvotes: 2

Related Questions