Earlier this year, there was a major hardware failure affecting standard NeCTAR volume storage in QRIScloud. While we were able to restore services to this volume storage cluster, we were not able to resolve the underlying issues. To achieve this, we need to perform a complete rebuild of our storage cluster. Due to the architecture of the storage cluster (ceph), we have been able to split the capacity of the cluster into serving the existing volumes and to start rebuilding this cluster.
Since this incident, we have been investigating methods to transparently migrate users from the original cluster to the rebuilt cluster. Unfortunately, our exhaustive investigation yielded no viable method for QRIScloud to perform this migration transparently, and as such we need affected volume storage users to assist us to migrate data from their current volumes into brand new volumes.
Volumes created before the 19th of April, 2016 in the QRIScloud availability zone with a volume type of QRIScloud are affected by this issue. Volumes that have a QRIScloud-rds volume type are not affected by this issue. QRIScloud support staff will contact users with affected volumes directly in the near future.
Volume creation and deletion can be done through the NeCTAR dashboard or by using the "cinder" command-line utility. Creating a new volume in the QRIScloud availability zone with no volume type specified will automatically create the new volume in the rebuilt cluster.
We have a hard deadline for completing the volume migrations by the 29th of July, 2016. We strongly encourage instance administrators to migrate data onto new volumes as soon as practical over the next six weeks.
Depending on your use case for volume storage, there may be several options available to you for performing this migration.
Use-case: volume storage is being used as scratch space
Some users of volume storage use their volumes for large-scale working space for computations running on their instances. This "use-case" is the simplest to deal with. What we recommend is the following:
- Turn off your job scheduler, and wait for your current set of computations to finish.
- Check that the mounted volume contains nothing that needs to be kept.
- Unmount the volume (e.g. "sudo umount <mount-point>") and detach it from your instance.
- Delete the volume.
- Create a new volume of an appropriate size via the NeCTAR Dashboard.
- Attach your new volume to your instance, and reboot the instance.
- Check that the new volume has been mounted properly.
- Re-start your job scheduler.
Use-case: volume holds a database or file collection
If your volume contains a database or a file collection that you don't want to lose, there are a couple of options:
- You could back the database or files to external storage, create a new volume, and then restore it / them onto the new volume.
- You could create a second volume, attach and mount it on an instance, then use a file copy utility to copy the relevant files to the new volume.
In either case, you will need to take appropriate steps to ensure that you get a consistent copy of the data that is migrated. This will probably entail shutting down user-facing services.
Use-case: boot from volume
Some users are currently using volume storage because they needed to "boot from volume". This used to be necessary for some people because the (old) "m1" NeCTAR flavors limited an instance's primary filesystem to 10GB. That was problematic for boot images that had a large amount of software pre-installed.
Unfortunately, there is no good option for migrating "boot from volume" volumes. If you still need to use this, you are going to have to rebuild your instance starting with a fresh image and a new volume.
These days, there are "m2" flavors which provide a 30GB primary file system on local disk. This should be sufficient for images with large software installs. If your image fits easily into a 30GB file system, you may no longer need to use "boot from volume" anymore.
Alternatively, if you are using "boot from volume" as a convenience to hold large amounts of data, or a database, there are better ways to achieve this. You could:
- use the ephemeral file system,
- use normal volume storage, and attach / mount it onto your instance, or
- use NFS storage.
What will happen if you miss the deadline.
As stated above, the 29th of July is a hard deadline. It is constrained by NeCTAR's OpenStack "Liberty" upgrade schedule and we can't change it.
Any volumes that still remain on the old cluster at that time will be detached from their respective NeCTAR instances. Then the volumes will be saved to offline storage, and the old cluster will be shut down permanently. This will allow the servers to be rebuilt, and disk storage to be introduced into the the new cluster.