ORIGINAL: 2017-07-27 (pm) to 2017-07-28 09:50
We have been experiencing significant performance issues with the main QRIScloud GPFS servers in Polaris. This has been causing Nextcloud and Aspera to be very slow, and occasionally to lock up entirely. It has also been leading to slow file access on Polaris HPC.
We think that the problem has been fixed. However, if you experience further issues, please report them to QRIScloud support.
UPDATE: 2017-07-29 to 2017-07-28 09:30: The issues relating to the GPFS server performance recurred on Saturday, and have been impacting the Nextcloud service.
The root cause of these problems is that a group of users have been using a GPFS collection in a way that it was never designed to be used. They ignored the long-standing advice in the QRIScloud Collections Do's and Don'ts page. Specifically:
- They used a collection to hold short term working data files.
- They created millions of small files in a short space of time.
- They attempted to compute against the working files on one of the HPC systems.
- They then deleted the files.
In this case, the traffic has overwhelmed the GPFS infrastructure and the HSM backend. (This kind of data should not be written to an HSM backed or replicated QRIScloud collection. Data should not be stored as millions of small files. If your applications demand this, then use ZIP, TAR or similar to pack collections of small files into larger bundles, and only unpack them onto / into file storage that is NOT backed by HSM.)
UPDATE: 2017-08-02: The issues have been occurring today too. I have just been informed that the impact on Nextcloud and Aspera may have been being exacerbated by a sporadic kernel NFS bug (on the GPFS server) that is triggered under the heavy load that occurs when the server is in "recovery" mode, as it is now.
UPDATE: 2017-08-09: The issues are recurring this afternoon.
UPDATE: 2017-08-09 - 16:30: The immediate cause of this afternoon's problems has been addressed. It was due to a particular user's jobs running on Tinaroo and hammering the GPFS server.
UPDATE: The underlying issues causing the GPFS problems have now been fixed.