errogaht
errogaht

Reputation: 303

Galera node cant connect to cluster

Hello am using Galera with 10.1.12-MariaDB and SST method is xtrabackup-v2

please dont recommend SST=rsync it's not work for me

I have healthy cluster 8 nodes, sometimes one or few nodes goes down. im just service mysql start on it and they successfully connecting to cluster and all is OK.

BUT sometimes, when disconnected nodes down few days i cant connect they to cluster.

after few tries im rm -fr /var/lib/mysql/* & rm -fr /var/log/mysql/* and nothing too, they have this message in syslog:

mysqld: [ERROR] Binlog file '/var/log/mysql/mariadb-bin.003079' not found in binlog index, needed for recovery. Aborting.

i know how work with this, i can recover cluster when i have nodes which can't connect to cluster with message above, so i do this:

  1. shutdown all nodes, and leave only one node
  2. shutdown last node and rm -fr /var/log/mysql/*
  3. bootstrap this last node with deleted binlog
  4. connect other nodes to cluster service mysql start
  5. profit - all is OK

But problem is:

I cant down all production nodes, and down last node too, because i have 8 nodes to serve big site traffic and one running node immediately down when all traffic goes to it (of course because overload)

QUESTION IS:

Please help me. How connect nodes to cluster when they won't connect and have error mysqld: [ERROR] Binlog file '/var/log/mysql/mariadb-bin.003079' not found in binlog index, needed for recovery. Aborting.

Upvotes: 1

Views: 2015

Answers (1)

Rick James
Rick James

Reputation: 142298

How big is the gcache? That controls whether IST can be used for re-attaching a node or not.

What is the value of expire_log_days? Is it so small that the binlog was lost before you tried to connect? If you lost one, and need another for SST, you still have 6 to serve the 'big site'. It sounds like you need to increase the deployment to maybe 10 nodes in order to handle the site even when nodes wink out.

It sounds like you are stuck with SST.

Take a look at the slowlog to see if some queries are taking so long that they are, indirectly, forcing you to have so many machines. Fixing a couple of queries is a lot 'cheaper' than adding extra machines.

Upvotes: 1

Related Questions