QRIScloud Collection Dos and Don'ts – QRIScloud

Summary

It is tempting to think of a QRIScloud RDSI collection as "just like a file system, only bigger". However, the "... only bigger" model glosses over some significant operational issues that collection users need to bear in mind. There are particular aspects where the scale / size of collections demands special care in order to avoid performance issues for yourself and for other users.

We have tried to summarize the various issues as a number of Dos and Don'ts.

Do educate your users
Do implement your own data loss risk management strategy
Don't use your collection as scratch space
Don't store lots of small files
Don't store files > 1TB
Don't make directories too wide or deep
Don't repeatedly ingest the same files
Don't mount your collection on a Windows system
Don't advertise your collection as available for off-net access
Don't move / rename large directory trees (Tier2)
Do monitor your collection's space usage (Tier 2)
Do respect the DMF cache (Tier 3)
Do respect the GPFS caches (GPFS)
Be cautious about putting user home directories onto an NFS-mounted collection
Do be careful with NFS user file security
Don't expose HSM collections to users via NFS

General

The following Dos and Don'ts apply to all kinds of collections.

Do educate your users

It is important that you make all direct users of your collection aware of what to do and what not to do. In particular, educate them on how collections are different from normal file systems, as described below.

Note that collections can be access via the HPC systems (Tinaroo, Flashlite, Awoonga, etc), and these systems have the ability to generate large amounts of file I/O. Since the HPC systems have a lot of parallelism and a lot of I/O bandwidth to the file server, a single poorly educated HPC user using a collection the wrong way can create serious operational problems.

Please bear in mind that we (QRIScloud Support) do not have any way of contacting the end users of your collection directly, or even of figuring out who they are. We are reliant on you (the collection owner / manager) to get the message out to the people who need to know.

Do implement your own data loss risk management strategy

As you should be aware, QRIScloud RDS collections are NOT backed up in the conventional sense. If you or your users overwrite or delete files, they may either not be recoverable at all, or might not be within timescales that would suit you. File recovery is not a service that we advertise. (The collection replication we do is designed to protect against major disasters, and media failures, not to provide a backup system for users. At a pinch we can get deleted or overwritten files back, but not quickly, and only within a relatively small time window.)

We recommend that you do your own risk assessment and, based on that, take steps to reduce or mitigate the risk of data loss. For example, you could:

Implement your own backup strategy for files that are particularly important or vulnerable.
Educate your users on strategies for avoiding mistakes.
Only give trusted and experienced users direct write access.
Investigate the possibility of mirroring your data set somewhere else.

Clearly, there is a trade-off between giving people enough access to do their work while limiting the damage that their accidents could cause.

For users of UQ RDM storage provisioned in QRIScloud, please seek the advice of the RDM team on what (if any) additional backup is needed for UQ research data.

Don't use your collection as scratch space

It is tempting to treat collections as scratch space for short term file storage. For example, you might be tempted to unpack (e.g., unzip) or decompress large files in place (on the collection) to make it easier to run applications against the data.

Please do not do this.

When you unpack or decompress a file, it generates a new file or files. The file replication system does not know that these new files duplicate existing files, so it will write replicas of these new files to tape. Too much replication tends to cause the back-end HSM system to get back-logged, delaying genuine replication and causing performance issues for other collections.

For HPC users:

Stage your files into /30days or /90days or your job's $TMPDIR, and unpack archives there.
Applications should read / write / update files in the above file systems.
Write files back to your collection at the end of your job; i.e. after your applications have finished.
Please understand what your application is doing.

Don't store lots of small files

It is sometimes tempting to store your data as lots of small files. It may simplify application programs or scripting, but this comes at a significant cost, both in terms of resource utilization and operational overheads.

Please do not do this.

Data stored as lots of small files impacts on the collection storage system's performance in many ways:

It consumes "inode" space in the NFS servers and the back-end replication systems.
It significantly increases the front-end NFS traffic.
It significantly increases the work that replication systems have to do.
It increases the load on the back-end tape systems.
When data needs to be recovered from a tape, or moved between disk-based storage systems for operational reasons, it increases the time needed to do.

The recommended way to deal with lots of small files is to store them in an archive (e.g., a TAR or ZIP file). Then you can either:

modify your applications to read TAR / ZIP files directly (there are common runtime libraries to do this), or
unpack the archive to your instance's local file system.

Note that on GPFS-based collections, there is a limit on the number of inodes (files + directories + symlinks) that is imposed by the file system. That limit defaults to 200,000 inodes. This can be increased (on request) but there are compelling operational reasons for the limit, so we are unlikely to increase it simply for users' convenience.

Don't store files > 1TB

This is the converse of the previous point. If you want to archive a large amount of data, it is tempting to "tar" or "zip" them all into a single archive for long-term storage or backup purposes. This is a good idea, up to a point. However if the archive (or any other single file) is too big, it causes problems.

Please do not store single files that are larger than 1TB. The problems it causes are as follows:

Our current infrastructure is physically incapable of creating off-site replicas of files greater than about 5TB. (There is insufficient disk space in the St Lucia DMF cache). Files in the 1TB to 5TB range cause serious congestion for off-site replication. We have therefore been forced to disable off-site replication for files that are much larger than 1TB.
Handling really large files is potentially problematic for you. For example:
- You need a lot of scratch file-space to unpack a > 1TB archive file.
- Copying or transferring a > 1TB file takes a significant time. The probability of a network interruption during the transfer or of (undetected) corruption during the transfer is proportional to the size of the file.
- Archive file formats like "tar" and "zip" are not resilient to problems caused by data corruption; e.g. undetected errors while a file is in transit, or during copying. Compression or encryption make this worse.

Don't make directories too wide or deep

If you put lots of files into the same directory, it will slow down various file system operations. For example, `ls` has to read all of the directory entries and sort them before displaying them. The performance impact can be worse when NFS is involved. The effect will be noticeable when there are thousands of files, and it will get worse. (For some operations, the slowdown effect is non-linear. For example the time taken to delete N files from a directory is proportional to N x log N ...)

Creating a directory structure that is very deep can also be bad for performance. For example, when your application opens a file whose pathname has lots of directories (or symbolic links) in it, then the operating system has to do a directory lookup for each one. If the directories are on an NFS server, then each one of those lookups may entail a network interaction which could take milliseconds to complete.

For small numbers of files, these things generally don't matter. When you have hundreds of thousands of files, the structure of your directories can have a significant performance impact.

Don't repeatedly ingest the same files

We have seen a case where someone was repeatedly ingesting the same (very large) file. This was not crazy, and the user was using a custom tool that did an incremental update to the file. Nevertheless, a 20TB file was being updated every few days, which caused the collection replication system to write multiple copies of the file to tape.

If you really need to repeatedly ingest like this, please contact QRIScloud Support for advice.

Don't mount your collection on a Windows system

It seem convenient to mount your collection on a Windows PC or laptop. However, we recommend that you don't do that We think that this places the data in your collection at increased risk.

For example, you may have heard of "CryptoLocker". This is malware that can potentially infect any system that runs Microsoft Windows. Once it has got into the system, it goes looking for documents and data files and encrypts them. The idea is that the criminals who are controlling the malware can then extort money from you to decrypt your files so that you can use them again (you hope!).

The problem is that "CryptoLocker" and programs like it can reach out to files in remotely mounted servers, and encrypts them as well. That could include your RDSI collection ... if you have made the mistake of mounting it on your PC.

Don't advertise your collection as available for "off-net" access

Most network traffic to and from QRIScloud data centres travels over network infrastructure provided by AARNet. While network traffic to and from many locations is (effectively) free, not all traffic is. AARNet deems many IP addresses (in Australia and internationally) to be "off-net". For off-net sites, all network traffic between "us" and "them" is metered. If that traffic exceeds a cap, then the AARNet subscriber (e.g., QCIF) gets charged per megabyte of network traffic from then on. These excess usage charges can be very substantial.

AARNet network charges incurred are not currently passed on to collection owners. It is we who pay, not you. However, we do ask you to consider carefully the network cost implications of allowing people to access your collection from an "off-net" location.

You can use the AARNet3 Network Address Query page to check if a network location is "off-net". Just enter the hostname or IP address and click "Submit".

Disk Collections

This advice applies to collections in Tier 1 or Tier 2 storage pools which are kept permanently on disk.

Don't move / rename large directory trees

We currently implement collection replication by using the standard "rsync" utility to write a copy of your files to a (private) HSM storage pool. Unfortunately, if you move a large directory tree (e.g., using the "mv" command), the rsync utility does not "understand". Instead, it creates a new copy of the files in the HSM pool, and then deletes the old copy. If you do this on a large tree of files, you could cause an rsync to huge amounts of data ... unnecessarily. This causes the replication to take a long time and creates a lot of HSM churn and tape traffic.

If possible, please avoid moving large directory trees. If you really need to do this (or if you have done it by accident), please contact QRIScloud Support for assistance. (We can minimise the impact by running an equivalent "mv" command on the replica tree.)

Do monitor your collection's space usage

Collections in Tier 1 and Tier 2 storage pools are implemented with file system quotas. The soft quota is (normally) set to be the same as the approved RDSI allocation size for the collection, and the hard quota is currently set to double the soft quota. (The soft / hard ratio is likely to change for operational reasons, with hard quotas being limited to no more than 10% above the soft quota.)

If your collection goes over the soft quota limits, you currently have 60 days' grace to get your collection back under quota. If the collection stays over quota for longer, the soft quota violation escalates to a hard quota violation, as described below. Unfortunately, we currently have no automatic way to notify users that they are over quota. It may be a few days before we let you know.

If your collection goes over hard quota, the NFS servers will fail any requests to create or update files or directories. In order to recover this yourself, you will need to delete enough files to enable you to go back under the soft quota. This is a bad situation, as it may affect your workflows, so try to avoid this situation.

At any time, you can use SSH to connect to the collection VM for your collection, "cd" to the collection directory, and run the "quota" command. This will tell you how your current usage compares with your assigned quotas.

If you do go over quota, and need assistance to recover, please contact QRIScloud Support.

Tier3 Collections

This advice applies to collections in Tier 3 storage pools stored in HSM. This means that files are migrated automatically between disk and tape storage "on demand".

Do respect the DMF cache

The Tier 3 storage pools are implemented as a hybrid of disk storage and tape storage, i.e. a relatively small amount of disk space backed by a much larger amount of tape storage. The disk space is a file cache that is shared across all collections in your collection's storage pool. (The tier3a1 file cache is ~120TB and the tier3b1 file cache is ~30TB, reflecting the different usage patterns stated or implied by your RDSI allocation request.)

When you attempt to access a file in your collection that is not currently in the cache, this causes the tape system to load a tape, "seek" to the appropriate point, and transfer the file into the disk cache. This takes time, and may cause other files to be removed from the cache to make space. Doing this repeatedly in a simplistic fashion can impact on performance both for your collection and for others.

There are various things that you can do to minimise the performance impact, including:

using "dmget" to fetch files in batches,
using "dmput" to evict files that you no longer need, and
careful choice of parameters when using tools like "rsync".

There is a detailed article called "Using DMF" on this topic on the VWranglers site. We recommend that you read it. If you need further advice, please contact QRIScloud Support.

GPFS Collections

GPFS collections consist of a series of on-disk caches with a hidden Tier3 backing storage. The cache sizes are managed on a per-collection so that people using other collections won't cause your files to disappear from your cache. But the flip-side is that GPFS doesn't have tools (equivalent to "dmput" / "dmget") that you can use to manage your cache usage.

Do respect the GPFS caches

Here are some things that you should do if your collection is larger than its cache size:

chose the parameters when using "rsync"
if you need to stage large numbers of files or large amounts of data into your collection's cache, please contact QRIScloud Support.

NFS-mounted Collections

QRIScloud collections can be NFS-mounted on NeCTAR instances in the QRIScloud availability zone. The following Dos and Don't apply to that scenario.

Be cautious about user home directories on to an NFS-mounted collection

It is tempting to use a collection to store home directories for users on (for example) a cluster of NeCTAR instances. There are problems with doing this:

If you put end users' home directories on an NFS mounted collection, your users may be unable to login during or following an NFS outage. (If your admin account's home directory is on NFS, an outage sould lock out the account that needs to be used to remount the NFS file system.)
Putting users' home directories in your collection is likely to give users the wrong idea. In particular, they are liable to think that they can use the collection storage like a "normal" file system, which they shouldn't. Reasons why are given above.

If you want to implement user home directories across a cluster, we recommend that you designate one of your instances as an NFS server, export the relevant file trees, and have the other instances NFS-mount them. This leaves you with some problems:

Where do you get the file space from?

Three possibilities spring to mind:

You could use the NFS server instance's primary or ephemeral file system.
You could request Volume Storage and mount that on your NFS server.
You could implement a hybrid system where users have a relatively small home directory, and a per-user directory to store files within your collection.

What about backup / replication?

That would be for you to implement. Unlike collection storage, QCIF cannot provide a disaster recovery mechanism for Instance file systems or Volume Storage.

What about availability?

If you build an NFS server for home directories on a NeCTAR instance,home directory availability will now depend on that instance being alive and accessible. Furthermore, if you use Volume Storage to hold the data, that introduces another single point of failure; i.e. the Ceph cluster that provides the storage. Building a highly available file service on top of standard network instances would be difficult.

By contrast, the RDSI NFS servers are designed to provide high availability with redundancy at the server and disc controller level, and RAID 6 disk arrays.

Do be careful with NFS user file security

Suppose that you have set up a multi-user system with individual accounts, and you have created directories on your (NFS-mounted) collection so that users can read and write files in their own names. The question is - how secure are your users' files? Is it possible for one user to access another user's files, despite what the Linux access controls apparently allow?

The answer is that it depends. There are a few things to consider:

Anyone with root access (e.g., "sudo" access) on any instance in your tenant can access any files on the collection. All they need to do is "su" to the other the target user, and they have access. (This is a standard Unix / Linux risk ... if you grant root access.)
Anyone who is a member of your NeCTAR tenant can launch a new instance (which gives them sudo rights), NFS-mount the collection, and access some other user's files.
It is also possible to make mistakes that will allow one user to see another user's files. For instance, if you create user accounts by hand, and accidentally have inconsistent numeric user and group ids across the VMs that mount the collection, then a user account on one VM may get access to another user's files.

If the (somewhat) fragile nature of access control is a concern to you, you should be extra careful about who is granted sudo rights and tenant membership, and implement a scheme to keep user and group details consistent.

Don't expose Tier3 and GPFS collections to users via NFS

We recommend that you avoid exposing Tier3 and GPFS collections to end users via an NFS mount:

There are a number of traps that are liable to cause HSM performance problems for inexperienced users.
We don't yet have a simple way to install the DMF client-side tools on a NeCTAR instance. Since these tools are required for "cache-friendly" usage (see above), it will be difficult for users to follow that advice.
We don't have any solution (yet) for users to stage their own GPFS files.

Comments