sarlacii
sarlacii

Reputation: 563

QEMU KVM disk IO/SQL replication issue, on one of two identical clone VM's

I have a system running two QEMU KVM virtual machines, identical clones of one another. Both VM's are replicating from the same Master MySQL DB. One VM (vm-01) is carrying an active load, and is running fine. However, the other (standby) VM (vm-02) suddenly fell behind with replication, at 08:00 this morning, and even though replication is running properly, it keeps falling further behind at a slow rate (1s behind for every 10s of real time). vm-02 has been running perfectly for months to date.

After checking all the usual suspects (CPU load, disk space, SQL query errors etc. etc.) it turns out that everything is just fine... except for the virtual disk IO - specifically the write requests (WRRQ). On the host machine:

virt-top 16:01:35 - x86_64 16/16CPU 1596MHz 128915MB
3 domains, 2 active, 2 running, 0 sleeping, 0 paused, 1 inactive D:0 O:0 X:0
CPU: 1.8%  Mem: 32768 MB (32768 MB by guests)

   ID S RDRQ WRRQ RXBY TXBY %CPU %MEM    TIME   NAME                                                                                                     
    3 R    3    1 113K  20K  1.3 12.0  62d21:21 vm-01-ubuntu
    9 R    0  563  97K  11K  0.5 12.0  83:09:51 vm-02-ubuntu
    -                                           (vm-Clone-ubuntu)

Both VM's have bin-logs disabled, so they only write the relay-bin-log. The active machine (vm-01-ubuntu) is running thousands of radius requests just fine, in addition to the exact same master SQL commands... and it is happily running with a few write requests. But the standby machine falls behind, with hundreds of write requests... perhaps related to replication catching-up... but so slowly?

Checking disk IO on the VM's:

vm-01:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad01)   18/09/2019      _i686_  (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          12,04    0,02    9,85   13,87    0,13   64,09
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0,00    13,91    0,91  147,67     5,20    16,05     0,29     0,11    0,72    0,57    0,73   0,04   0,65

vm-02:~# iostat -x
Linux 4.4.0-141-generic (vm-finrad02)   18/09/2019      _i686_  (1 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,26    0,01    0,25    6,46    0,09   92,93
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0,00     1,22    0,00   34,19     0,20    21,43     1,26     0,00    0,14    0,96    0,14   0,03   0,09

Doesn't yield any glaring issues, especially since the busier VM (vm-01) is doing more as expected.

The host machine has 128Gb of RAM, tons of SSD drive space, and is only running at 30% CPU usage. There are no RAID or drive issues.

Any suggestions on where to check next, given that the WRRQ count is the only evidence to date of vm-02 falling behind. Or am I chasing a red herring?

Upvotes: 0

Views: 164

Answers (1)

sarlacii
sarlacii

Reputation: 563

The issue is related to the guest OS, not the VM setup.

On Ubuntu the apt auto-update feature is quite aggressive, and in the case of the two suspect VM's, apt was attempting to constantly update the repos, writing at 16mB/s constantly. This is probably related to the fact that the Guest OS is Ubuntu 14.04, and the repos are no longer maintained.

The solution was to disable auto-updates, and rather run updates manually. As root:

service unattended-upgrades stop
echo manual | tee /etc/init/unattended-upgrades.override

Then, edit apt configs to disable packages auto-refresh. Replace "APT::Periodic::Update-Package-Lists "1";" with "0":

cd /etc/apt/apt.conf.d/
cp 10periodic 10periodic.original
cat 10periodic | awk -F" " '$1=="APT::Periodic::Update-Package-Lists" {printf "%s %s\n",$1,"\"0\";"; next}1' > 10periodic

And lastly, disable the repos from the auto-upgrade list:

nano /etc/apt/apt.conf.d/50unattended-upgrades

Find section "Unattended-Upgrade::Allowed-Origins" and comment out the lines:

//"${distro_id}:${distro_codename}-security";
//"${distro_id}ESM:${distro_codename}";

I then rebooted the VM, and all has been well.

Upvotes: 0

Related Questions