Advice: Outage in QRIScloud NeCTAR Volume Storage: 2016-01-20 to 2016-01-23 resolved

Follow

There is currently a problem with the Ceph storage servers that provide QRIScloud's NeCTAR Volume Storage.  QRIScloud NeCTAR instances with volumes attached to them may be experiencing disk errors, resulting lockups.

We are investigating the problem urgently.

UPDATE 2016-01-20 14:10 - The Ceph cluster (QLD-Ceph which stores the NeCTAR Volume Storage data) has sustained a hardware failure.  Due to insufficient storage capacity to cope with a node failure, QLD-Ceph is now running in degraded mode and is likely to go into read-only mode very soon. If you have critical data on your volume that you have not backed up, you should do this NOW.

When the Ceph Cluster goes into read-only mode:

  • Instances that are booted from a volume in in the QRIScloud AZ will lock up.
  • Instances that have a volume attached and mounted will experience hard file system errors.  This may result in file system corruption.

Also, if you no longer need your QRIScloud NeCTAR volume, could they please contact QRIScloud Support (07 3346 4202) urgently.

UPDATE 2016-01-20 16:05 - All instances with affected volumes attached have been paused and locked to prevent damage to their file systems.

UPDATE 2016-01-21 09:25 - Significant progress was made overnight.  A lot of disk space has been freed by sacrificing a large Volume.  The Ceph cluster now needs to finish re-replicating so that all disk blocks have replicas on two servers. When that is done, we can safely unpause the paused instances.

UPDATE 2016-01-21 14:40 - The re-replication of QRIScloud NeCTAR Volume Storage is continuing, and we are optimistic that it will be completed overnight.  The next update will be provided tomorrow morning.

UPDATE 2016-01-22 09:30 - Our latest estimate is that the QRIScloud NeCTAR Volume Storage re-replication will be completed this afternoon.

UPDATE 2016-01-23 11:00 - The Ceph cluster is now running safely, and all paused instances have been unpaused and unlocked.  If there are problems with these instances, please report them via the existing support tickets.

Have more questions? Submit a request

Comments

  • Avatar
    Derek Benson
    I added a 1/2 TB volume yesterday to an instance to back up some data temporarily during a reboot. I'd be happy to blow that away if you thought it would help. Derek
  • Avatar
    Matthew DeMaere
    Updates to status much appreciated. Good luck with the maintenance.
Powered by Zendesk