Reputation: 361
Did some extended maintenance on a node d1r1n3 out of a 14x node dsc 2.1.15 cluster today, but finished well within the cluster's max hint window.
After bringing the node back up most other nodes' hints disappeared again within minutes except for two nodes (d1r1n4 and d1r1n7), where only part of the hints went away.
After few hours of still showing 1 active hintedhandoff task I restarted node d1r1n7 and then quickly d1r1n4 emptied its hint table.
Howto see for which node stored hints on d1r1n7 are destined? And possible howto get hints processed?
Update: Found later corresponding to end-of-maxhint-window after taking node d1r1n3 offline for maintenance that d1r1n7' hints had vanished. Leaving us with a confused feeling of whether this was okay or not. Had the hinted been processed okay or some how just expired after end of maxhint window? If the latter would we need to run a repair on node d1r1n3 after it's mainenance (this takes quite some time and IO... :/) What if we now applied read [LOCAL]QUORUM instead of as currently read ONE w/one DC and RF=3, could this then trigger read path repairs on needed-basis and maybe spare us is this case for a full repair?
Answer: turned out hinted_handoff_throttle_in_kb was @ default 1024 on these two nodes while rest of cluster were @ 65536 :)
Upvotes: 0
Views: 453
Reputation: 477
hints are stored in cassandra 2.1.15 in system.hints table
cqlsh> describe table system.hints;
CREATE TABLE system.hints (
target_id uuid,
hint_id timeuuid,
message_version int,
mutation blob,
PRIMARY KEY (target_id, hint_id, message_version)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (hint_id ASC, message_version ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = 'hints awaiting delivery'
AND compaction = {'enabled': 'false', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
the target_id correlated with the node id
for example
in my sample 2 node cluster with RF=2
nodetool status
Datacenter: datacenter1
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 71.47 KB 256 100.0% d00c4b10-2997-4411-9fc9-f6d9f6077916 rack1
DN 127.0.0.2 75.4 KB 256 100.0% 1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa rack1
I executed the following while node2 was down
cqlsh> insert into ks.cf (key,val) values (1,1);
cqlsh> select * from system.hints;
target_id | hint_id | message_version | mutation
--------------------------------------+--------------------------------------+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa | e80a6230-ec8c-11e6-a1fd-d743d945c76e | 8 | 0x0004000000010000000101cfb4fba0ec8c11e6a1fdd743d945c76e7fffffff80000000000000000000000000000002000300000000000547df7ba68692000000000006000376616c0000000547df7ba686920000000400000001
(1 rows)
as can be seen the system.hints.target_id correlates with host id in nodetool status (1ca6779d-fb41-4a26-8fa8-89c6b51d0bfa)
Upvotes: 1