Yhrn
Yhrn

Reputation: 1063

MySQL query slowing down until restart

I have a service that sits on top of a MySQL 5.5 database (INNODB). The service has a background job that is supposed to run every week or so. On a high level the background job does the following:

  1. Do some initial DB read and write in one transaction
  2. Execute UMQ (described below) with a set of parameters in one transaction.
    • If no records are returned we are done!
  3. Process the result from UMQ (this is a bit heavy so it is done outside of any DB transaction)
  4. Write the outcome of the previous step to DB in one transaction (this writes to tables queried by UMQ and ensures that the same records are not found again by UMQ).
  5. Goto step 2.

UMQ - Ugly Monster Query: This is a nasty database query that joins a bunch of tables, has conditions on columns in several of these tables and includes a NOT EXISTS subquery with some more joins and conditions. UMQ includes ORDER BY also has LIMIT 1000. Even though the query is bad I have done what I can here - there are indexes on all columns filtered on and the joins are all over foreign key relations.

I do expect UMQ to be heavy and take some time, which is why it's executed in a background job. However, what I'm seeing is rapidly degrading performance until it eventually causes a timeout in my service (maybe 50 times slower after 10 iterations).

First I thought that it was because the data queried by UMQ changes (see step 4 above) but that wasn't it because if I took the last query (the one that caused the timeout) from the slow query log and executed it myself directly I got the same behavior only until I restated the MySQL service. After restart the exact query on the exact same data that took >30 seconds before restart now took <0.5 seconds. I can reproduce this behavior every time by restoring the database to it's initial state and restarting the process.

Also, using the trick described in this question I could see that the query scans around 60K rows after restart as opposed to 18M rows before. EXPLAIN tells me that around 10K rows should be scanned and the result of EXPLAIN is always the same. No other processes are accessing the database at the same time and the lock_time in the slow query log is always 0. SHOW ENGINE INNODB STATUS before and after restart gives me no hints.

So finally the question: Does anybody have any clue of why I'm seeing this behavior? And how can I analyze this further?

I have the feeling that I need to configure MySQL differently in some way but I have searched and tested like crazy without coming up with anything that makes a difference.

Upvotes: 2

Views: 3353

Answers (2)

Radoslav Bod&#243;
Radoslav Bod&#243;

Reputation: 661

thank you very much for the analysis and answer. I've been searching this issue for several days during ci on mariadb 10.1 and bacula server 9.4 (debian buster).

The situation was that after fresh server installation during a CI cycle, the first two tests (backup and restore) runs smoothly on unrestarted mariadb server and only the third test showed that one particular UMQ took about 20 minutes (building directory tree during restore process from the table with about 30k rows).

Unless the mardiadb server was restarted or table has been analyzed the problem would not go away. ANALYZE TABLE or the restart changed the cardinality of the fields and internal query processing exactly as stated in the linked article.

Upvotes: 1

Yhrn
Yhrn

Reputation: 1063

Turns out that the behavior I saw was the result of how the MySQL optimizer uses InnoDB statistics to decide on an execution plan. This article put me on the right track (even though it does not exactly discuss my problem). The most important thing I learned from this is that MySQL calculates statistics on startup and then once in a while. This statistics is then used to optimize queries.

The way I had set up the test data the table T where most writes are done in step 4 started out as empty. After each iteration T would contain more and more records but the InnoDB statistics had not yet been updated to reflect this. Because of this the MySQL optimizer always chose an execution plan for UMQ (which includes a JOIN with T) that worked well when T was empty but worse and worse the more records T contained.

To verify this I added an ANALYZE TABLE T; before every execution of UMQ and the rapid degradation disappeared. No lightning performance but acceptable. I also saw that leaving the database for half an hour or so (maybe a bit shorter but at least more than a couple of minutes) would allow the InnoDB statistics to refresh automatically.

In a real scenario the relative difference in index cardinality for the tables involved in UMQ will look quite different and will not change as rapidly so I have decided that I don't really need to do anything about it.

Upvotes: 3

Related Questions