Reputation: 3374
We have created a cassandra cluster with 9 nodes. Each one is equipped with 4Cores and 16G RAM. We are writing 15-25 Million records with 28 columns.
The data model we have designed is as follows ( i just renamed the columns and shortened the actual schema for making it brief).
CREATE TABLE main_table(
col1 ... col28,
PRIMARY KEY((col1,col2),col_date,col_with_some_seq_number))
WITH CLUSTERING ORDER BY (col_date DESC,col_with_some_seq_number desc) AND default_time_to_live = 5270400;
CREATE MATERIALIZED VIEW mv_for_main_table AS
SELECT [col1.. col11],
FROM main_table
WHERE col1 IS NOT NULL AND col2 IS NOT NULL AND col_date IS NOT NULL AND col_with_some_seq_number IS NOT NULL
PRIMARY KEY ((col1),col2, col_date, col_with_some_seq_number)
WITH CLUSTERING ORDER BY (col_date DESC, col_with_some_seq_number DESC, col2 DESC);
Its just moving one of the partition key to clustering key in materialized view.
We are loading the data from spark and do not modified any cassandra related configurations.
After ingesting around 150 Million records, the ingestion started failing and each node is giving lot of mutation failures.
Is there any performance issues with materialized views.? or the definition i have used is not efficient.?
We have tried few changes to configuration such as reducing the concurrent writes,throughput MB. After all the tries, we have dropped materialized view and then every thing started working well.
We have done enough testing to conclude that only after materialized view inclusion the writes are getting slow by huge margin and mutations are getting dropped.
We are planning to have separate tables instead of materialized views for the above configuration, but i want to know if there is any mistake with the materialized views or data model that we have used.
Upvotes: 3
Views: 1935
Reputation: 8812
One place to understand materialized views (MV) in depth: http://www.doanduyhai.com/blog/?p=1930
There is a lock on a partition of the base table when having MVs. This local lock has a cost (see in my blog post)
I have also another remark about your hardware sizing, 4CPUs is below the official recommendation which is 8 CPUs: http://cassandra.apache.org/doc/latest/operating/hardware.html
Write workload in Cassandra is CPU-bound.In your case your CPU is also used by Spark, that may explain your bottleneck.
Please post here a screen capture of dstat
and htop
Upvotes: 3
Reputation: 5180
What the materialized view does is create another table and write to it when you write to the main table. So, if you drop the materialized view and create manually another table I'm afraid you'll be on the same boat.
In my opinion, the performance problem is due to overloading one particular node. Indeed, when you demote one of your PARTITION KEY column to a CLUSTERING KEY column, assuming the same data ingestion pattern (that assumption clearly holds, because each write is "reflected" to the other table), you are going to create hotspots, because more data tend to lie on the same partition. This translates to longer compactions and read-repairs, and more stress on the cluster in general (eg because each node have to sort more data for each partition).
Upvotes: 2