samsamsamsmasma
samsamsamsmasma

Reputation: 17

Galera Cluster: All Nodes Slowed Down After Update Query and Too Many Connections Error

I met a disability while using galeracluser and I have a question.

I'm using MariaDB 10.5.13 version with Galera Cluster. The issue I'm facing is that all nodes are slowing down due to a query, resulting in connection timeouts.

Here's the scenario:

  1. I have a Galera Cluster consisting of nodes A, B, and C.(The instance was t3.Medium of aws.)
  2. I created a table on node A using the default DDL replication setting, which is Total Order Isolation (TOI). The query was as follows: The item table has 2.2 million rows.

CREATE TABLE item_temp AS SELECT COUNT(*) FROM item WHERE item_status = 1;

  1. After creating the table, I executed an update query on node A. (I know I'm changing every row. This was a mistake)
UPDATE item SET item_status = 3 where id = (select id from item_temp);
  1. After the execution of the update query, all nodes slowed down, and the error message 'Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)' kept occurring.

4-1. killed maria db of 'A' node. and I tried restarting 'A' maria db but failed

  1. When I checked the process list on nodes B and C, I noticed that the create query was still visible (appearing in the process list). (Note: appearing in the process list indicates that the create query is still in progress.).
    It was very slow when connecting to the server with the shell.

b node enter image description here c node enter image description here

  1. To address the slowdown, I restarted all nodes. As a result, the update query was not reflected.

My suspicion is that the update query caused the issue, but I'm unsure why all nodes slowed down. Here are my hypotheses:

The create query might still be in the process list, causing all write transactions to be blocked, which could have had an impact. Alternatively, the update query might have been applied immediately and replicated to other nodes simultaneously, resulting in many node updates and impacting the server. Any insights into why all nodes slowed down would be greatly appreciated.

I want to know why not only the node where the update query occurred slowed down, but also to other nodes


+++

Instance RAM usage is as follows.
enter image description here
The innodb buffer is.
enter image description here
The row size of the item table used in the ddl statement is as follows.
0.18 KB


I was analyzing node c and found the following log, after the create statement occurred.
enter image description here

2024-03-15 13:22:34 8 [ERROR] mysqld: Error writing file 'mariadb-bin' (errno: 1950 "Unknown error 1950") 2024-03-15 13:22:34 8 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.item_temp; An attempt was made to binlog GTID 0-1-117195197 which would create an out-of-order sequence number with existing GTID 0-1-117195197, and gtid strict mode is enabled, Error_code: 1950; Error writing file 'mariadb-bin' (errno: 1950 "Unknown error 1950"), Error_code: 1026; the event's master log FIRST, end_log_pos 319, Internal MariaDB error code: 1950 2024-03-15 13:22:34 8 [Warning] WSREP: Event 4 Write_rows_v1 apply failed: 1, seqno 117208461

Upvotes: 0

Views: 390

Answers (1)

danblack
danblack

Reputation: 14646

Back when Galera was fixed to make CREATE TABLE .. SELECT work (MDEV-7673) this placed the cluster into TOI mode for this statement to replicate. This implementation is still there.

So recommendation 1, use CREATE TABLE and then INSERT INTO .. SELECT, especially if you have a large query that ultimately aggregates to a small result set. This small result set will replicate quickly in a non-TOI mode.

3./4. where as a result of the TOI mode where, as your processlist in (5. image shows), was showing, the CREATE TABLE .. SELECT was being executed. The lack of database connections is the existing UPDATE/INSERT queries waiting until TOI is over are consuming all the connections.

5.

The TOI mode CREATE TABLE ... SELECT running over an hour is odd. Even 2M rows I'd expect quicker. Did you configure a larger innodb buffer pool size to match your instance size? Is this a table with a particularly large row size?

6.

A restart unless something is genuinely stalled is usually a mistake. The rollback can be long.

The non-appearance of UPDATE item SET item_status = 3; is expected as without a certification result from the cluster this transaction was never committed. As the server was shutdown, this is considered an aborted transaction as as such was rolled back from the undo log on restart.

Upvotes: 1

Related Questions