Red October
Red October

Reputation: 689

MariaDB 10.4 random performance degradation

I have a server with such parameters:

With installed 5.5.5-10.4.12-MariaDB-1:10.4.12+maria~bionic. On this screenshot shown a standard DB load:

enter image description here

So I have around 400-500 selects per second (mostly from not so large table with 500k records), 100-190 updates per second, and around 50-150 simultaneous connections.

My problem is: sometimes, for no apparent reason, the server has 2000-3000 open connections/processes. According to SHOW FULL PROCESSLIST they are standard SQL requests but with 'Sending data' state and with 400-500 seconds of runtime. Of course, at this time the server freezes and cannot function normally. I said "for no apparent reason" because at this time I do not see any increase in users number or an increase in activity on the website. In addition, restarting the MariaDB service or a complete reboot of the server helps to get out of this situation, but not always: sometimes even after rebooting I almost instantly get the same 2000-3000 frozen processes.

Has anyone encountered similar database behavior? I would be grateful for any ideas.

UPD:

  1. All of my SELECTs calls only one table (~500k records, without JOIN and/or subqueries), and mostly of them have LIMIT 1, so not so large amount of data.

  2. Error log shown a lot of records like this: 2020-08-26 22:12:35 787380 [Warning] Aborted connection 787380 to db: ... (Got timeout reading communication packets)

  3. innodb_lock_wait_timeout is 50 (default)

  4. Slow query log does not shown something unusual

  5. My optimizer_switch settings: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on

Upvotes: 1

Views: 2491

Answers (4)

TheWitness
TheWitness

Reputation: 1

So, I've been struggling with "Sending Data" queries that hang with a very high write profile system, and I've found used the above solution as well as reducing the dirty pages threshold and using the FORCE INDEX (blah) query directive (due to a MariaDB bug). In may case, the MariaDB version is in the 10.5.x series, and specifically 10.5.13 right now. The solution was as follows:

# -------------------------------------------------------------------
# Flush as often as possible to increase performance on high
# update tables
# -------------------------------------------------------------------
innodb_max_dirty_pages_pct_lwm = 0.0001
innodb_max_dirty_pages_pct     = 0.0001

# -------------------------------------------------------------------
# Thread handling
# -------------------------------------------------------------------
thread_handling             = pool-of-threads
thread_pool_size            = 112

Before the solution, I easily had over 50GB of dirty pages, and now it's averaging around 7-10GB. Since the system is entirely NVMe, flushing the pages more often does not cause any appreciable increase in lag, but the combination of the two has helped.

Note: As previously mentioned, there is currently a Query Optimizer bug with MariaDB where the optimizer will not pick the correct index to run queries. There are workarounds, but the best one at the moment is to use the FORCE INDEX (blah) query directive on your larger queries. This is the third step that was taken. In fact, I was just watching a video from the MariaDB Foundation where they advised to always test your queries on the new versions of MariaDB before doing the upgrade. As a workaround, I'm doing the picking for MariaDB now as often as possible. Here are the bug reports: MariaDB Issue MDEV-25480 MariaDB Issue MDEV-25830

Note: This system has 56 cores in SMT mode, and I've not seen much benefit from SMT mode at all. Likely will turn it off at some point.

Note: Though you can run EXPLAIN on both UPDATE and DELETE queries, you can not FORCE INDEX (blah) on either of them. So, this bug really needs to be resolved by the MariaDB Foundation. Would be a nice patch to be able to use FORCE INDEX (blah) for those query types though.

Upvotes: 0

Red October
Red October

Reputation: 689

The solution turned out to be quite simple: after studying the MariaDB documentation (especially this article https://mariadb.com/kb/en/thread-pool-in-mariadb/) I added the following to my.cnf and the problem is gone

thread_handling=pool-of-threads
thread_pool_size=48 
#48 is a number of CPUs

Upvotes: 1

hsibboni
hsibboni

Reputation: 463

Anything in the error log ?

If the database freezes, it is most likely a disk issue : could be that the disk is full, mariadb freezes for 1 minute if it cannot write anything , the disk could be full if a temporary table fills it, or an alter on one table, using the copy algorithm does; are you monitoring your disk usage (not in the image, you should)? It could be that the disk I/Os are all used by one query : then all query would still run, but very slowly, so is stuck or just really slow? It could be a lock problem ?

Since query are running for a long time (400-500s), it is most likely not a lock : unless you have changed it, the lock wait timeout is shorter (at least it is on innodb : 50s).

If you know that no ALTER TABLE are run, and there are no disk issue (you might want to check the inodes too), it still could be a lock : SHOW ENGINE INNODB STATUS\G to check.

You said that doing a SHOW FULL PROCESSLIST there are only standard SQL request, so most likely no ALTER TABLE.

If you have a poorly written query, a temporary table could fill your disk, so you will need to EXPLAIN the queries shown when doing a SHOW FULL PROCESSLIST to analyze this, and rewrite / optimize / limit the size of the resultset of such queries, look for using temporary (you can also have a sort on disk sometimes: using filesort). The slow query log will tell you if there are queries using the disk (if they are not killed when you restart the server).

If you do not have the time to optimize the queries, and if they are big SELECT slowing your whole database, to display information to user (reporting), you can kill the queries taking too long with a script : this should be a last resort (your script killing queries taking too long could write them, so that you will be able to analyze them later on).

Temporary tables filling the disk or using all the I/Os are the only cases where I have seen the database freeze and start again after a reboot. For the cases when the database just freezes again, maybe a user is doing the same query again (and again).

Edit

It could be it is not your database the problem, but your web application : the error log message indicates that the database is killing some connection.

The combination of queries are Sending Data and aborted connections is unusual for me. Usually, aborted connections happens if the web application is not closing the connections, and they are in a Sleep state. You could check everything in this post :

  • check for network issues (firewall)
  • check for errors in your web application logs
  • check if the max_allowed_packet is big enough (if your SELECT are returning one row, it should not be a problem)

If there are sleeping queries, then you are not closing the connections properly, and then you reach the max_connection limit, and no new connection can happen. Something still is not clear : is the database very slow, or is nothing happening at all ? What is happening on the web server side ?

It could also be that the driver (mariadb client) leaves the connection and the query in a Sending data state, and is not fetching the end of the data. If it is buffering the output, and is killed before it actually can (and it does not close the connection either), this could happen. It does not fit with the LIMIT 1 , but that could explain why there are aborted connections and SELECT queries in Sending data state. What language is your web application in ? I can think of php unbuffered query with php processes crashing to recreate such a situation, but it could be another language specific issue. Anyway, this would be very rare.

Upvotes: 1

Gordan Bobić
Gordan Bobić

Reputation: 1858

This sounds like a classical case of query optimizer randomly going brain-dead. It's a long standing heisenbug.

When you see the queries piled up, run SHOW EXPLAIN FOR thread_id for one of the piled up IDs. See if the query plan is non-senical. If it is, edit the query app-side to include index hints to prevent the query optimizer getting it wrong. If you cannot change the queries, you will have to fiddle with optimizer_switch setting until you identify and remove the specific option making the optimizer go crazy.

Upvotes: 1

Related Questions