There has been a hardware fault detected in one of our redundant network switches. This hardware fault is likely to be the root cause of intermittent network issues that have been reported to be affecting some QRIScloud services.
In the short term, we will be operating with reduced redundancy and investigating options for routing traffic around the faulty equipment. Longer term, some of the impacted equipment has already been slated for decommissioning and will likely run in degraded mode for the remainder of its operation (to September).
Impacted services include the Awoonga HTC cluster and some QRIScloud Virtual Machines (VM).
For Virtual Machines, you can determine if your QRIScloud VM is affected by the reduced network redundancy by looking at the CPU type using the command line tool "lscpu". The model name of affected VMs is "AMD Opteron" (Stage 2 compute host). VMs with a "AMD EPYC Processor" (Stage 5 compute host) or "Intel(R) Xeon(R) CPU" (NSP compute host) model name are unaffected by this reduced redundancy.
For the Awoonga HTC cluster, we will perform an emergency re-routing of its network connections this Friday evening (19th July) between 5:00pm and 5:30pm to bypass the affected network switch. This will result in a short network outage and may impact running jobs. The Awoonga job queues have been temporarily suspended and once the network re-routing is complete the job queues will be resumed.
We apologise for any inconvenience this may cause. If you have any questions, please contact the QRIScloud Help Desk at firstname.lastname@example.org.