A security patch to the Linux kernel in CentOS 6 that was recently pushed through the CentOS repositories is causing problems for NeCTAR users. The symptoms that people observe are as follows:
- The user's instance fails to reboot properly on the first and subsequent attempts to reboot following the installation of the kernel patch.
- The instance is typically in ACTIVE state according to the NeCTAR Dashboard, but is unresponsive; e.g. you cannot SSH to it or PING it.
- The instance's VNC console will either show a "kernel panic" error message, or it will be blank.
- Attempts to cure the problem with a "soft" or "hard" reboot does not help.
This kernel patch has introduced a bug into the CentOS 6 kernel which causes these failures when running on AMD hardware under an Ubuntu 16.04 hypervisor. The faulty kernel version number is 2.6.32-696.30.1.
See also Denis Lujanski's "CentOS 6 not booting with kernel-2.6.32-686.30.1.el6.x86_64" article in the NeCTAR "Known Issues" forum for more background information.
Reviving a stuck instance
If your instance has gotten into the state where it won't boot, in theory you can use the instance console to tell the "GRUB" bootloader to boot an older kernel. In practice, NeCTAR standard images have GRUB configured with a 5 second window for doing this, and that is typically too short.
To get past this point, you need to modify the "/boot/grub/grub.conf" file on the instance's root disk to change the GRUB timeout.
- If you have OpenStack command tools installed (and the relevant knowledge), it is worth trying to use "nova rescue ..." or "openstack server rescue ..." to get access to the root disk.
- Otherwise, you should submit a NeCTAR or QRIScloud support request asking for help.
The previously mentioned article provides details on setting the GRUB timeout. We recommend increasing it to 30 seconds. That will give you time to get to the console and use the GRUB menu to boot an older (working kernel). Your instance should then work until the next soft or hard reboot.
Longer term solutions
The bug was reported to the CentOS maintainers a couple of weeks ago, but they are yet to provide a proper fix. Instead, they have suggested a workaround that (unfortunately) we cannot apply in NeCTAR because it would have a significant a performance impact on all NeCTAR instances.
We are strongly recommending that all users of CentOS 6 on NeCTAR should upgrade to a newer operating system at the earliest opportunity. Note that upgrading will give other benefits including new tools, newer versions of existing tools and fewer library dependency problems for 3rd-party software.
As a short term alternative to upgrading to a newer OS, you could do one of the following:
- Remove the package for kernel version 2.6.32-696.30.1 and pin the kernel package to an older version; e.g. 2.6.32-696.29.1. This is NOT recommended, as it is opting out of all further security kernel updates.
- Install and use a main-line kernel version 4.x.x-x.el6.elrepo.x86_64 from ELRepo.
The procedures for doing these two tasks are described in the previously mentioned article.