Dan
Dan

Reputation: 6504

Google Compute instance won't mount persistent disk, maintains ~100% CPU

During some routine use of my web server (saving posts via WordPress), my instance suddenly jumped up to 400% CPU usage and wouldn't come back down below 100%. Restarting and stopping/starting the instance didn't change anything.

Looking at the last bit of my serial output:

[    0.678602] md: Waiting for all devices to be available before autodetect
[    0.679518] md: If you don't use raid, use raid=noautodetect
[    0.680548] md: Autodetecting RAID arrays.
[    0.681284] md: Scanned 0 and added 0 devices.
[    0.682173] md: autorun ...
[    0.682765] md: ... autorun DONE.
[    0.683716] VFS: Cannot open root device "sda1" or unknown-block(0,0): error -6
[    0.685298] Please append a correct "root=" boot option; here are the available partitions:
[    0.686676] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    0.688489] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-30-generic #34~14.04.1-Ubuntu
[    0.689287] Hardware name: Google Google, BIOS Google 01/01/2011
[    0.689287]  ffffea00008ae400 ffff880024ee7db8 ffffffff817af477 000000000000111e
[    0.689287]  ffffffff81a7c6c0 ffff880024ee7e38 ffffffff817a9338 ffff880024ee7dd8
[    0.689287]  ffffffff00000010 ffff880024ee7e48 ffff880024ee7de8 ffff880024ee7e38
[    0.689287] Call Trace:
[    0.689287]  [<ffffffff817af477>] dump_stack+0x45/0x57
[    0.689287]  [<ffffffff817a9338>] panic+0xc1/0x1f5
[    0.689287]  [<ffffffff81d3e5f3>] mount_block_root+0x210/0x2a9
[    0.689287]  [<ffffffff81d3e822>] mount_root+0x54/0x58
[    0.689287]  [<ffffffff81d3e993>] prepare_namespace+0x16d/0x1a6
[    0.689287]  [<ffffffff81d3e304>] kernel_init_freeable+0x1f6/0x20b
[    0.689287]  [<ffffffff81d3d9a7>] ? initcall_blacklist+0xc0/0xc0
[    0.689287]  [<ffffffff8179fab0>] ? rest_init+0x80/0x80
[    0.689287]  [<ffffffff8179fabe>] kernel_init+0xe/0xf0
[    0.689287]  [<ffffffff817b6d98>] ret_from_fork+0x58/0x90
[    0.689287]  [<ffffffff8179fab0>] ? rest_init+0x80/0x80
[    0.689287] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.689287] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

(Not sure if it's obvious from that, but I'm using the standard Ubuntu 14.04 image)

I've tried taking snapshots and mounting them on new instances, and now I've even deleted the instance and mounted the disk on to a new one, still the same issue and exactly the same serial output.

I really hope my data has not been hopelessly corrupted. Not sure if anyone has any suggestions on recovering data from a persistent disk?

Note that the accepted answer for: Google Compute Engine VM instance: VFS: Unable to mount root fs on unknown-block did not work for me.

Upvotes: 0

Views: 1015

Answers (2)

Nostalg.io
Nostalg.io

Reputation: 3752

I posted this on another question, but this question is worded better, so I'll re-post it here.

What Causes This?

That is the million dollar question. After inspecting my GCE VM, I found out there were 14 different kernels installed taking up several hundred MB's of space. Most of the kernels didn't have a corresponding initrd.img file, and were therefore not bootable (including 3.19.0-39-generic).

I certainly never went around trying to install random kernels, and once removed, they no longer appear as available upgrades, so I'm not sure what happened. Seriously, what happened?

Edit: New response from Google Cloud Support.

I received another disconcerting response. This may explain the additional, errant kernels.

"On rare occasions, a VM needs to be migrated from one physical host to another. In such case, a kernel upgrade and security patches might be applied by Google."

How to recover your instance...

After several back-and-forth emails, I finally received a response from support that allowed me to resolve the issue. Be mindful, you will have to change things to match your unique VM.

  1. Take a snapshot of the disk first in case we need to roll back any of the changes below.

  2. Edit the properties of the broken instance to disable this option: "Delete boot disk when instance is deleted"

  3. Delete the broken instance.

    IMPORTANT: ensure not to select the option to delete the boot disk. Otherwise, the disk will get removed permanently!!

  4. Start up a new temporary instance.

  5. Attach the broken disk (this will appear as /dev/sdb1) to the temporary instance

  6. When the temporary instance is booted up, do the following:

In the temporary instance:

# Run fsck to fix any disk corruption issues
$ sudo fsck.ext4 -a /dev/sdb1

# Mount the disk from the broken vm
$ sudo mkdir /mnt/sdb
$ sudo mount /dev/sdb1 /mnt/sdb/ -t ext4

# Find out the UUID of the broken disk. In this case, the uuid of sdb1 is d9cae47b-328f-482a-a202-d0ba41926661
$ ls -alt /dev/disk/by-uuid/
lrwxrwxrwx. 1 root root 10 Jan 6 07:43 d9cae47b-328f-482a-a202-d0ba41926661 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Jan 6 05:39 a8cf6ab7-92fb-42c6-b95f-d437f94aaf98 -> ../../sda1

# Update the UUID in grub.cfg (if necessary)
$ sudo vim /mnt/sdb/boot/grub/grub.cfg

Note: This ^^^ is where I deviated from the support instructions.

Instead of modifying all the boot entries to set root=UUID=[uuid character string], I looked for all the entries that set root=/dev/sda1 and deleted them. I also deleted every entry that didn't set an initrd.img file. The top boot entry with correct parameters in my case ended up being 3.19.0-31-generic. But yours may be different.

# Flush all changes to disk
$ sudo sync

# Shut down the temporary instance
$ sudo shutdown -h now

Finally, detach the HDD from the temporary instance, and create a new instance based off of the fixed disk. It will hopefully boot.

Assuming it does boot, you have a lot of work to do. If you have half as many unused kernels as me, then you might want to purge the unused ones (especially since some are likely missing a corresponding initrd.img file).

I used the second answer (the terminal-based one) in this askubuntu question to purge the other kernels.

Note: Make sure you don't purge the kernel you booted in with!

Upvotes: 1

George
George

Reputation: 1110

In order to recover your data, you need to create a brand new instance where you can ssh, and attach the corrupted disk to it as a secondary disk. More information can be found in this article. I would suggest taking a snapshot of the corrupted disk before attaching it, for backup purposes.

Upvotes: 0

Related Questions