Reputation: 23
We have DSE in production divided into two datacenters. One data center is doing spark and one is for SOLR, apart from Cassandra data storage.
Recently we are observing node getting down so frequently that we almost need to spend entire time to observing and up the DSE process.
So far we attempted to remove some old data, we already created a c# console application which fetches data in pegging manner and removes it from the production node just decrease storage load from the node(s).
however, I observed some changes which might affect the performance but I am not entirely sure on that part.
Moved machine domain: we are in process to change domain across the organization. as a part of the process, some machine's domain already changes and some are in progress. Does it affect the internal process when it's come to inter-machine communication as two machines from same datacenter are in a different domain?
frequent data removal process run: as I mention that we created a process which removes old data but as we remove data so it will convert those data into a tombstone and can slow down compaction process which can busy DSE for a longer time and the same time may be scala job trying to run along with client request. this may be hang-up DSE process. If this is the case what will be best for remove old data
total data load vs node count: as of now, we have almost 6tb of data (With replication factor 3) and 15 DSE node (9 for ANALYTICS, 6 for SOLR). Do we need to add some extra machine to handle the node
Upvotes: 0
Views: 337
Reputation: 2206
To answer some of your questions:
1) Changing host domain - does it affect Cassandra. Typically no. Where this could come into play is if you used host names in your code/contact points (possibly certificates if being used). I don't recall if this is a requirement or not, but our yaml files have IP addresses, not host names, so those are not affected.
2) Deleting data does have an impact on processing and reads. Lots of them can cause problems (overwhelming tombstones, heap aggravation, etc.). If you have to delete, you have to delete. If you have time series-type data, I would recommend TWCS with TTL for deletes - as it solves a lot of problems. If not, you'll have to deal with the tombstones and compaction problems that could arise.
3) This question probably needs a bit more clarification before it can get answered. Do you have 6TB of data on each DC (i.e. 6TB of data on analytics DC and 6TB of data on SOLR DC), 6TB total including all replicas (e.g. 3TB for analytics and 3TB for SOLR), or 6TB of data before any replicas are counted (i.e. 18TB total when including replicas)?
As for your initial statements about noticing nodes going down. Have you determined why they are going down? What does the cassandra log file reveal? In our environments, if a node goes down it's typically for a few reasons:
1) GC issues - If GCs take too long, this can cause big problems and eventually take out nodes. Look for "GCInspector" in your system.log to get an idea of how long GC is taking.
2) Heap memory issues - Each release of DSE seems to consume more memory, and with DSE 6.X you need to pay attention to some of the yaml changes that take place (some of the config changes now default to OFF-HEAP, which will cause more memory consumption than previous versions). In the system log file, especially with DSE 6.X (in our case 6.7), I've seen many occurrences of "Out Of (Heap) Memory" messages that have taken out nodes. If you're using 6.7.3 and you use node sync, there is a known memory leak bug (that we encountered) that will cause this problem. Upgrade/patch to 6.7.4.
3) O/S memory issues - We have some environments that don't have a lot of memory - 20GB, and the O/S initiates OOM Killer to free up memory , killing either or both DSE and/or the datastax agent (OpsCenter).
Those are the main ones I've seen over the years, but there could be other reasons that would be revealed in the system log file.
-Jim
Upvotes: 0