Galera Cluster: All Nodes Slowed Down After Update Query and Too Many Connections Error

Question

I met a disability while using galeracluser and I have a question.

I'm using MariaDB 10.5.13 version with Galera Cluster. The issue I'm facing is that all nodes are slowing down due to a query, resulting in connection timeouts.

Here's the scenario:

I have a Galera Cluster consisting of nodes A, B, and C.(The instance was t3.Medium of aws.)
I created a table on node A using the default DDL replication setting, which is Total Order Isolation (TOI). The query was as follows: The item table has 2.2 million rows.

CREATE TABLE item_temp AS SELECT COUNT(*) FROM item WHERE item_status = 1;

After creating the table, I executed an update query on node A. (I know I'm changing every row. This was a mistake)

UPDATE item SET item_status = 3 where id = (select id from item_temp);

After the execution of the update query, all nodes slowed down, and the error message 'Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)' kept occurring.

4-1. killed maria db of 'A' node. and I tried restarting 'A' maria db but failed

When I checked the process list on nodes B and C, I noticed that the create query was still visible (appearing in the process list). (Note: appearing in the process list indicates that the create query is still in progress.).
It was very slow when connecting to the server with the shell.

b node c node

To address the slowdown, I restarted all nodes. As a result, the update query was not reflected.

My suspicion is that the update query caused the issue, but I'm unsure why all nodes slowed down. Here are my hypotheses:

The create query might still be in the process list, causing all write transactions to be blocked, which could have had an impact. Alternatively, the update query might have been applied immediately and replicated to other nodes simultaneously, resulting in many node updates and impacting the server. Any insights into why all nodes slowed down would be greatly appreciated.

I want to know why not only the node where the update query occurred slowed down, but also to other nodes

+++

Instance RAM usage is as follows.

The innodb buffer is.

The row size of the item table used in the ddl statement is as follows.
0.18 KB

I was analyzing node c and found the following log, after the create statement occurred.

2024-03-15 13:22:34 8 [ERROR] mysqld: Error writing file 'mariadb-bin' (errno: 1950 "Unknown error 1950") 2024-03-15 13:22:34 8 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.item_temp; An attempt was made to binlog GTID 0-1-117195197 which would create an out-of-order sequence number with existing GTID 0-1-117195197, and gtid strict mode is enabled, Error_code: 1950; Error writing file 'mariadb-bin' (errno: 1950 "Unknown error 1950"), Error_code: 1026; the event's master log FIRST, end_log_pos 319, Internal MariaDB error code: 1950 2024-03-15 13:22:34 8 [Warning] WSREP: Event 4 Write_rows_v1 apply failed: 1, seqno 117208461

Galera Cluster: All Nodes Slowed Down After Update Query and Too Many Connections Error

b node c node

Answers (1)

Related Questions