MariaDB 10.5.13 with Galera Cluster: Occasional Signal 11 Crashes During High Transaction Load with GET_LOCK/RELEASE_LOCK

Question

Environment:
I am using MariaDB 10.5.13 with a Galera Cluster setup.
innodb storage engine.

Instance type: AWS t3.medium.

Memory usage details:
.

My.cnf configuration:


#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#

# this is read by the standalone daemon and embedded servers
[server]
basedir=/home/admin/service/mariadb
aria_log_dir_path=/home/admin/service/mariadb/data/
character_sets_dir=/home/admin/service/mariadb/share/charsets/
datadir=/home/admin/service/mariadb/data/
pid_file=/home/admin/service/mariadb/data/mariadb.pid
plugin_dir=/home/admin/service/mariadb/lib/plugin/
flashback=On
innodb_flush_method=O_DIRECT
innodb_lock_schedule_algorithm=VATS
# deprecated 10.5
#innodb_scrub_log=FALSE
default_storage_engine=InnoDB
default_time_zone='+0:00'
binlog_checksum = CRC32
## not used character_set_client = utf8mb4
## not used character_set_connection = utf8mb4
## not used character_set_database   = utf8mb4
## not used character_set_results    = utf8mb4
## not used collation_connection     = utf8mb4_unicode_ci
character_set_filesystem = binary
character_set_server     = utf8mb4
collation_server         = utf8mb4_unicode_ci
character_set_client_handshake=FALSE
ft_max_word_len=84
ft_min_word_len=1
innodb_adaptive_flushing=1
innodb_adaptive_hash_index=1
innodb_default_row_format=DYNAMIC
innodb_file_per_table=1
innodb_ft_max_token_size=252
innodb_ft_min_token_size=1
#innodb_large_prefix=1
# deprecated 10.5
#innodb_log_checksums=1
innodb_log_compressed_pages=1
#on 4gb
innodb_log_file_size=192M
innodb_strict_mode=1
innodb_autoinc_lock_mode=2
innodb_random_read_ahead=1
innodb_use_native_aio=1
sync_binlog=1
innodb_flush_log_at_trx_commit=1
log_bin_trust_function_creators=1
thread_handling=pool-of-threads
event_scheduler=ON
innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1
innodb_defragment=1
default_storage_engine=InnoDB
binlog_format=ROW
innodb_strict_mode=1
log_basename=mariadb
log_bin
log-error
log_bin_compress=ON
master_verify_checksum=ON
memlock=ON
myisam_use_mmap=ON
query_cache_type=ON
query-cache-wlock-invalidate=ON
query_cache_size=1048576
slow_query_log=ON
#ssl                                                        FALSE

# deprecated 10.5
#innodb_buffer_pool_instances=2
slave_parallel_mode=optimistic
innodb_buffer_pool_size=700M
key_buffer_size=20M
#(4*1024*1024*1024-(437*1024*1024)-300*1024*1024)/(19*1024*1024)
#(total mem - (437mb, global) - (300mb, os) ) / (19mb, thread per conn)
#on 4gb
max_connections=128
performance_schema = ON
gtid_strict_mode=1
slave_compressed_protocol=1
wait_timeout=150
table_definition_cache=512
skip-name-resolve=1

# this is only for embedded server
[embedded]

# this is only for the mysqld standalone daemon
[mysqld]

#
# * Galera-related settings
#
[galera]
#VATS should not be used with Galera
innodb_lock_schedule_algorithm=FCFS
# Query cache cannot be used in a Galera context..
query_cache_type = 0
query_cache_size = 0
memlock=OFF
# Mandatory settings
wsrep_on=ON
wsrep_auto_increment_control=ON
wsrep_provider=/home/admin/service/mariadb/lib/libgalera_smm.so
wsrep_cluster_address='gcomm://private dns~~,~~ , ~~’
## gcs.fc_limit= wsrep_slave_threads * 5
wsrep_provider_options='gcache.size=512M;gcache.recover=yes;pc.recovery=yes;gcs.fc_limit=20;gcs.fc_factor = 0.8;gcs.fc_master_slave=yes'
wsrep_cluster_name=main_db_cluster
#
# Allow server to accept connections on all interfaces.
#
bind-address=0.0.0.0
#
# Optional setting
wsrep_slave_threads=4
#IST or SST will recover from loss
innodb_flush_log_at_trx_commit=0
innodb_doublewrite=1
wsrep_auto_increment_control=1
wsrep_gtid_mode=ON
wsrep_sst_auth=admin
wsrep_sst_method=mariabackup
# first use
#wsrep_sst_method=rsync
log-slave-updates=ON

Issue: The database sometimes crashes with a signal 11 error. (I understand that Galera Cluster does not support GET_LOCK and RELEASE_LOCK, but this code was written before I became aware of this limitation.)

Question: I know that GET_LOCK and RELEASE_LOCK are not supported in Galera Cluster, but could these queries potentially cause memory issues that lead to crashes? The queries typically run fine, but every few months, the database crashes. I am unsure whether the issue is related to the queries themselves or to memory overload.

Problem Scenarios: Scenario 1:

Between 00:08 and 00:09, around 1000 database transactions occur within a minute. Each transaction uses GET_LOCK and RELEASE_LOCK to acquire a lock and modify specific rows.
The database received a signal 11 error.
Around 25 minutes later, the database crashed with a signal 6 error.

Scenario 2:

Between 00:08 and 00:09, around 1000 database transactions occur within a minute. Each transaction uses GET_LOCK and RELEASE_LOCK to acquire a lock and modify specific rows.
The database encountered a segfault, received a signal 11 error, and crashed.
. .
The log shows that RELEASE_LOCK was related to the problematic query.

Please let me know if there's a more specific issue with the queries or if memory limitations on my instance could be causing these crashes.

I will show you part of the query. Sensitive information has been abbreviated, but the overall flow and query are the same.

This process involves retrieving stores and calculating the order amount for the day in a loop.  
The schema is structured in a typical manner, consisting of company(store), bucket, bucket_item, product and tx(company transaction).

[1] Retrieve the companies.
[2] Loop through the companies.
[3] Begin a transaction.
[4] Retrieve the company bucket (locking read).
[5] Create an internal transaction.
[6] Retrieve the products in the bucket (locking read).
[7] Update the total price of the bucket.
[8] Create a user lock.
[9] Register the store ledger(tx).
[10] Release the user lock.
[11] Link the order to the ledger(tx).
[12] The transactions are then sequentially closed.


-----
[1]
Select company 
Result : 1000

[2]
loop {

[3]


[4]
SELECT
    b.id                    AS id
FROM bucket AS b
WHERE
b.id IN (#{bucketSeq})
FOR UPDATE

[5]


[6]
WITH product_mapping AS (
    SELECT
        p.id                     AS product_seq
    FROM product p
)
SELECT
    bi.id                            AS id,
    bi.product_seq                   AS product_seq
FROM bucket_item AS bi
         LEFT JOIN product_mapping pm ON pm.product_seq = bi.product_seq
WHERE bi.bucket_seq = #{bucketSeq}
    AND bi.type=#{bucketItemType}
GROUP BY bi.id
FOR UPDATE

[7]
UPDATE bucket AS b
SET b.total_price=#{totalPrice},
    b.total_cost=#{totalCost},
    b.priced_at=#{pricedAt},
    b.total_price_tax_free=
        (SELECT IFNULL(SUM(bi.unit_price * bi.amount), 0)
         FROM bucket_item AS bi
         JOIN product AS p ON p.id = bi.product_seq
         WHERE bi.bucket_seq = b.id)
WHERE b.id=#{bucketSeq}

[8]

SELECT GET_LOCK(CONCAT("AA_", #{key}), #{timeoutSeconds})


[9]
INSERT INTO tx (
    happened_at,
    balance
)
    (
        SELECT
            #{happenedAt} AS happened_at,
            IFNULL(lbb.balance,0)+(#{amount}) AS balance
        FROM company AS c
             LEFT JOIN (SELECT
                            t.company_seq AS company_seq,
                            t.balance AS balance
                        FROM tx AS t
                        WHERE t.company_seq = #{companySeq}
                        ORDER BY t.happened_at DESC
                        LIMIT 1
        ) AS lbb ON c.id=lbb.company_seq
        WHERE c.id=#{companySeq}
        LIMIT 1
    )

[10]
< release lock>
SELECT RELEASE_LOCK(CONCAT("AA_", #{key}))


[11]
INSERT INTO bucket_tx (
    happened_at,
    bucket_seq,
    tx_seq
)
    (
        SELECT
            t.happened_at AS happened_at,
            b.id AS bucket_seq,
            t.id AS tx_seq
        FROM
            seq_1_to_1
                JOIN bucket AS b ON b.id = #{bucketSeq}
                JOIN tx AS t ON t.id = #{txSeq}
        WHERE
            NOT EXISTS (
                    SELECT
                        1
                    FROM
                        bucket_tx AS bt
                    WHERE
                        bt.tx_seq = #{txSeq}
                      AND bt.bucket_seq = #{bucketSeq}
                )
        LIMIT 1
    )

[12]


}

samsamsamsmasma · Accepted Answer

It was a problematic database version. We need to proceed with a version update.

Reference: https://jira.mariadb.org/browse/MDEV-24143

MariaDB 10.5.13 with Galera Cluster: Occasional Signal 11 Crashes During High Transaction Load with GET_LOCK/RELEASE_LOCK

Answers (1)

Related Questions