Reputation: 13061
i'm studying cassandra to understand how to model data to best manage an json like this:
{
"summary": {
"elem": [
{
"score": 15.8,
"value": "xxx"
},
{
"score": 15.7,
"value": "yyy"
},
{
"score": 13.9,
"value": "zzz"
}
],
"sens": [
{
"score": 23,
"start": 0,
"end": 210,
"value": "kkk"
},
{
"score": 12.1,
"start": 212,
"end": 326,
"value": "nnn"
}
]
},
"cats": [
{
"name": "c1",
"val": 10245,
"sens": [
{
"val": "mmm",
"els": [
{
"start": 25,
"end": 38,
"value": "ccc"
}
],
"score": 810,
"start": 0,
"end": 210
},
...
]
},
...
],
"ecv": {
"ens": [
{
"val": "bbb",
"text": "jjj",
"matches": [
{
"start": 2706,
"end": 2719,
"value": "aaa"
}
],
"properties": [
{
"name": "id",
"value": "0001"
},
{
"name": "uni",
"value": "V"
},
...
]
},
...
],
"rels": [
{
"ens": [
{
"text": "pp",
"start": 0,
"end": 7,
"value": "uuu"
},
{
"type": "rrr",
"start": 25,
"end": 38,
"value": "www"
}
],
"act": {
"name": "rtr",
"type": "fff",
"start": 122,
"end": 125
},
"sens": {
"value": "ddd"
}
},
...
]
},
"doms": [
{
"value": "yyy",
"fas": [
{
"val": "ccw",
"sens": {
"start": 0,
"end": 210,
"value": "xxx"
},
"els": [
{
"start": 169,
"end": 178,
"value": "bhh"
},
...
],
"ents": [
{
"val": "ents1",
"type": "xxx",
"matches": [
{
"start": 0,
"end": 7,
"value": "bbb"
}
]
},
...
]
},
...
]
},
...
]
}
I used for some months MongoDB so i think is simple to write this entire document to mongoDB collection.
I can't know how to desing my cassandra model to store that json.
Can someone give me a way to start "think in Cassandra" ?
Thanks!
Upvotes: 0
Views: 547
Reputation: 6495
A thing that's vital to understand is that there's no "one storage" model in Cassandra. You have a traditional ER model, from which you derive a logical model, that you combine with your constraints and performance requirements to get the final physical model. Your tables in cassandra are the last form, and the other forms are "in your head" or documented - ideas rather than tables. So, how do you get to those tables? You think in terms of your queries. Your tables cater to your queries, and as such without knowing something about access patterns, there's no way of saying how you should store it.
Data in Cassandra is stored in partitions. A partition is identified by the first key in the primary key. All rows in a partition are stored together on one machine (and replicas). If your query hits a partition, it's fast. If it hits multiple partitions (pk in query), it's slower. If it hits all partitions, it's slowest. However, if all queries hit a single partition, you get hotspots (some servers utilised heavily while others are idle). This can be bad in large clusters.
Inside a partition, data is sorted in order according to the clustering keys (the "other" keys in the primary key). You can only do filter (where) queries on a clustering key if all previous clustering keys have been specified. In addition, you can only to a range query (>, < etc.) on the last clustering key in a predicate. You can also create secondary indices, which can let you query on equality conditions outside of the clustering key requirements, though they are slower and are updated async.
There's a lot of intricacies there, and the queries you can perform are much more restricted (if any sort of perf is a consideration). So the "cassandra way" is think about your query patterns, and then store data based on those. If you have mutliple storage patterns, duplicate the same info in different forms. There is no "one true way" of storing data.
Upvotes: 2