There have been hardware issues with a small number of servers running QRISdata and QRIScompute services. Services appear to have been impacted from approximately 14:00 Saturday. Most services have been restored as of 17:00 Saturday. Work to restore the remaining services will recommence during business hours on Monday.
UPDATE - 2018-05-14 - 09:15
The hardware issues on Saturday caused a number of compute nodes and infrastructure servers to crash and/or restart. Many of the consequent problems have been fixed. For example, the QRIScloud LDAP outage that was affecting Nextcloud and (possibly for some users) HPC access was fixed this morning.
There is an outstanding problem with the the Ceph servers in the "QRIScloud RDS" storage cluster. A Ceph recovery is in progress, but the outage is impacting NeCTAR instances which rely on the cluster for volume storage, or for backing storage for root and ephemeral file systems. The latter includes instances in the QFAB pool and instances running on QRIScloud "stage 1" compute nodes.
The observed behavior for affected instances will be that the instance locks up. Rebooting is not advisable at this stage, as it will not reestablish access to the missing instance file systems. The best option is to wait while the cluster recovers.
UPDATE - 2018-05-14 - 12:45
A QRIScloud engineer is currently in the data centre inspecting the hardware with a view to swapping power supplies as a possible short-term fix.
UPDATE - 2018-05-14 - 13:55
The Ceph server that was down is now working and the storage cluster has been restored to service. Instances that were "stuck" because they were waiting on a file system operation should now be working. If you had attempted to reboot an affected instance during the outage, you may need to do it again.
If you continue to see problems, please raise a QRIScloud support ticket by emailing <firstname.lastname@example.org>.