Fernando
Fernando

Reputation: 1

Cannot join mariadb/galera node into a cluster after crash

One of our MariaDB/Galera clusters crashed last week. We started a new cluster with the first node, joined the second node, but couldn't join a third node.

We removed all files from data directory and the system started a SST job. But it seems mysql is getting a 'uuid' cache from somewhere and after the transfer it couldn't start and join the cluster. Logs:

2021-07-31 19:01:51 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 1
2021-07-31 19:01:51 0 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 00000000-0000-0000-0000-000000000000:-1
2021-07-31 19:01:52 2 [Note] WSREP: State transfer required: 
    **Group state: 6148b40a-ef57-11eb-92ab-77aa611985cb:581967649**
    Local state: 00000000-0000-0000-0000-000000000000:-
2021-07-31 19:01:52 2 [Note] WSREP: New cluster view: global state: 6148b40a-ef57-11eb-92ab-77aa611985cb:581967649, view# 5: Primary, number of nodes: 3, my index: 2, protocol version 3
2021-07-31 19:01:52 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2021-07-31 19:01:52 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.73.64.104' --datadir '/media/dados/mysql/'   --parent '28752'  ''  '''
2021-07-31 19:01:52 2 [Note] WSREP: Prepared SST request: rsync|10.73.64.104:4444/rsync_sst
2021-07-31 19:01:52 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-07-31 19:01:52 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
2021-07-31 19:01:52 2 [Note] WSREP: Assign initial position for certification: 581967649, protocol version: 4

2021-07-31 19:01:52 2 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (6148b40a-ef57-11eb-92ab-77aa611985cb): 1 (Operation not permitted)
     at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.


2021-07-31 19:01:52 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 581967650)
2021-07-31 19:01:52 2 [Note] WSREP: Requesting state transfer: success, donor: 0
2021-07-31 19:01:52 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 6148b40a-ef57-11eb-92ab-77aa611985cb:581967649


2021-07-31 19:55:01 0 [Note] WSREP: SST complete, seqno: 581967651

2021-07-31 19:55:04 0 [Note] WSREP: SST received: ba9d2e19-a7ed-11e8-ae5d-f7d6266c9160:581967651

2021-07-31 19:55:04 2 [ERROR] WSREP: Application received wrong state: 
    **Received: ba9d2e19-a7ed-11e8-ae5d-f7d6266c9160**
    Required: 6148b40a-ef57-11eb-92ab-77aa611985cb
2021-07-31 19:55:04 2 [ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.

The cluster is running with uuid: 6148b40a-ef57-11eb-92ab-77aa611985cb but after SST this node is 'receiving' uuid ba9d2e19-a7ed-11e8-ae5d-f7d6266c9160

Do you have any idea how to solve this issue ?

Thanks, Fernando

Upvotes: 0

Views: 1701

Answers (1)

mysqlrockstar
mysqlrockstar

Reputation: 2612

What is your wsrep_sst_donor value ? Have you started with empty datadir, particularly grastate.dat files ? Have you tried increasing the systemd timeout of MariaDB process on that node?

sudo tee /etc/systemd/system/mariadb.service.d/timeoutstartsec.conf <<EOF
[Service]
TimeoutStartSec=1200
EOF
sudo systemctl daemon-reload

Upvotes: 0

Related Questions