QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

Question

I have a system running two QEMU KVM virtual machines, identical clones of one another. Both VM's are replicating from the same Master MySQL DB. One VM (vm-01) is carrying an active load, and is running fine. However, the other (standby) VM (vm-02) suddenly fell behind with replication, at 08:00 this morning, and even though replication is running properly, it keeps falling further behind at a slow rate (1s behind for every 10s of real time). vm-02 has been running perfectly for months to date.

After checking all the usual suspects (CPU load, disk space, SQL query errors etc. etc.) it turns out that everything is just fine... except for the virtual disk IO - specifically the write requests (WRRQ). On the host machine:

virt-top 16:01:35 - x86_64 16/16CPU 1596MHz 128915MB
3 domains, 2 active, 2 running, 0 sleeping, 0 paused, 1 inactive D:0 O:0 X:0
CPU: 1.8%  Mem: 32768 MB (32768 MB by guests)

   ID S RDRQ WRRQ RXBY TXBY %CPU %MEM    TIME   NAME                                                                                                     
    3 R    3    1 113K  20K  1.3 12.0  62d21:21 vm-01-ubuntu
    9 R    0  563  97K  11K  0.5 12.0  83:09:51 vm-02-ubuntu
    -                                           (vm-Clone-ubuntu)

Both VM's have bin-logs disabled, so they only write the relay-bin-log. The active machine (vm-01-ubuntu) is running thousands of radius requests just fine, in addition to the exact same master SQL commands... and it is happily running with a few write requests. But the standby machine falls behind, with hundreds of write requests... perhaps related to replication catching-up... but so slowly?

Checking disk IO on the VM's:

vm-01:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad01)   18/09/2019      _i686_  (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12,04    0,02    9,85   13,87    0,13   64,09
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0,00    13,91    0,91  147,67     5,20    16,05     0,29     0,11    0,72    0,57    0,73   0,04   0,65

vm-02:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad02)   18/09/2019      _i686_  (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,26    0,01    0,25    6,46    0,09   92,93
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0,00     1,22    0,00   34,19     0,20    21,43     1,26     0,00    0,14    0,96    0,14   0,03   0,09

Doesn't yield any glaring issues, especially since the busier VM (vm-01) is doing more as expected.

The host machine has 128Gb of RAM, tons of SSD drive space, and is only running at 30% CPU usage. There are no RAID or drive issues.

Any suggestions on where to check next, given that the WRRQ count is the only evidence to date of vm-02 falling behind. Or am I chasing a red herring?

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

Answers (1)

Related Questions

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM&#39;s

Answers (1)

Related Questions

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's