Advice: QRIScloud-HA outage: 2017-04-29 9am onwards - completed


(Note: the outage details have changed since this page was first published)

On Saturday April 29th starting at 9am, we will be upgrading Ceph fileserver clients on all compute nodes in the QRIScloud-HA (NSP) cluster.  We will also need to patch and reboot all compute nodes (hypervisors) and all customer instances in the cluster to address a problem with live migration of instances.

The planned schedule is as follows:

  1. Soon after 9am, we will shutdown all customer instances that are currently running, suspended or paused.
  2. Patch and reboot the hypervisors.
  3. Test if the live migration issue has been fixed.
  4. If it has been fixed:
    1. Restart all instances that were running.
    2. Update the Ceph clients, using live migration to move instances to the upgraded nodes.
  5. Otherwise:
    1. Upgrade the Ceph clients while instances are shut down.
    2. Restart all instances that were running.

If the fix for live migration succeeds, customer instance outages should be about 30 minutes.  If not, then outages of 150 minutes are expected.  The total planned outage window is from 9am to 12am.

In the QRIScloud-HA cluster, instances' root and ephemeral file systems are held on the Ceph fileservers distributed across the cluster. There is a small risk that these might be damaged by the Ceph client upgrade. Customer Volume Storage volumes are held on a separate Ceph cluster, but the same Ceph clients are used for both sets of fileservers, so those volumes are potentially at risk as well. 

We therefore advise that QRIScloud-HA customers should take the follow precautions:

  • If your services need to be shutdown in a particular way, or if you have user-facing services that need to be put into "maintenance mode", please do do this before the start of the maintenance window; i.e. before 9am.
  • Make sure that you have up-to-date backups of all data, software and configurations that are held on root file systems, ephemeral file systems and volumes.
  • Make sure that you have copies of your backups stored outside of the QRIScloud-HA cluster. Copies held in NeCTAR Object Storage or in QRISdata collection storage are fine.
  • Avoid doing significant planned maintenance work on your production systems during the outage window.

UPDATE - 2017-05-02 - The changes happened without any incident, though the attempt to get live migration to work did not succeed.


