Arun
Arun

Reputation: 1752

Performance with rows_per_partition and datamodel in Cassandra

We have an application which has 10 tables of master[static] data (each having around 100 rows). Updates to those tables are negligible. All these tables data will be shown as select list on the application.

  1. Will there be any performance improvement when rows_per_partition is changed to 100 as below from the default "NONE"? Since these master tables are not updated and accessed all the time

Eg:

ALTER TABLE devloc.regions
with caching = {
    'keys' : 'ALL',
    'rows_per_partition' : '100'
};
  1. One table has 100 columns of data and queried frequently to display the information. This is like a lookup table.

    datamodel1

    CREATE TABLE devloc.display_all ( id uuid PRIMARY KEY, datevalue timestamp, col2 text, col3 text, col4 text, col5 text, col6 text, col7 text, ....... upto 100 columns )

    Query: Select *from devloc.display_all where id =89d23c25-4921-4d57-8f2c-87a9f4ca204d;

This is time series table and the data grows on daily basis for years. will adding datevalue as bucketing key would improve the performance with the query?

datamodel2

CREATE TABLE devloc.display_all ( id uuid, datevalue timestamp, col2 text, col3 text, col4 text, col5 text, col6 text, col7 text, ....... upto 100 columns ) with primary key(id, datevalue);

Completed the stress testing for both the models and saw good performance when datevalue wasn't used as bucket.

enter image description here

The first spike is datamodel1 and the second spike is with datamodel2 For us, latency matters a lot even with milliseconds. Can someone help me understand?

DSE 4.8.5
read Write Consistency level LOCAL_QUORUM
replication 3
Datacenters 2

Upvotes: 0

Views: 2738

Answers (2)

madooc
madooc

Reputation: 89

  1. rows_per_partition is how many rows of each partition will cache in "Row Cache" (where is the first place that cassandra will looking to when you run a read query). When you read that row again, cassandra is no need to find that row in the table again, so your read query will be faster.

  2. Partition Key is just for cassandra used to locate the location to stored that Partition in the ring and then it will ordered the data in that Partition by Clustering Column (as your second model). If you have only one row/partition, adding the clustering column to your Primary Key is not necessary at all.

Upvotes: 0

mmatloka
mmatloka

Reputation: 2014

  1. rows_per_partition enables the row caching and defines how many first rows of the partition will be kept in cache. If you have only 100 rows, then yes, it should cache them. This parameter can also have value ALL. However additionally row_cache_size_in_mb must be set to value which can hold all your rows in the memory.

  2. Performance not really (if you query it just by id). It could give you for sure ordering but it seems you have a single row per id (per pertition) so you don't need it. Remember that underneath clustering key value becomes a prefix of every column name in given row so theoretically it can give some overhead (have a look at composite-keyed table part http://www.planetcassandra.org/blog/composite-keys-in-apache-cassandra/ ).

Upvotes: 0

Related Questions