For the past few weeks, users have been experience intermittent performance issues with volumes in the QRIScloud NeCTAR volume storage cluster.
We have investigated this extensively, and come to the conclusion that it is caused by network contention on a network path that is common to the NeCTAR volume storage cluster and the Awoonga HPC cluster. The problem is that Awoonga has many compute nodes. If there are a number of I/O intensive jobs running at the same time, Awoonga is able to saturate the network link with traffic to RDS collections and/or the shared file systems. When this happens, NeCTAR instances attempting to access volume storage get starved.
Unfortunately, there is no quick fix for this problem. I am told that we would need to physically move the volume storage servers into a different rack, but there is insufficient space in the rack to hold them ... until some of the old Stage 2 compute nodes have been decommissioned and de-racked. We estimate that it will be 1 to 2 months until we can complete this work.