Using QRIScloud Collection Storage – QRIScloud

Introduction

This document is a starting point for people using Collection Storage that has been allocated through QCIF and provisioned in QRIScloud.

You can access the files using a variety of means, including:

Standard Access

using SSH-based file transfer protocols such as SFTP, SCP and RSYNC ... from any machine, or
via the filesystem on Awoonga, Flashlite and Tinaroo.

NFS Only Access:

via an NFS mount on a NeCTAR VM running in the QRIScloud availability zone,
using access services implemented on a VM, or
using a custom portal implemented on a VM.

(The three access categories are mutually exclusive.).

Non-UQ users can apply for QRIScloud Collection Storage from the QRIScloud Service Catalog. Per UQ policy, all UQ storage users should apply for storage through the UQ RDM system.

Collection Delivery

When we (QRIScloud) "deliver" a new collection to you, we will provide you with following information in a QRIScloud support ticket.

The collection number; e.g. "Qnnnn" where "nnnn" is a 4 digit number.
The (current) provisioned size.
The name and type of (current) storage pool in which it resides.
A brief summary of the access methods available for the collection.

We will also provide links to the relevant documentation for collection owners and users including:

A link to this document.
A link to the Collection Dos and Don'ts document.
A link to the Collection Access Management document (when applicable)
Links to the NFS and Nextcloud documents (as applicable)

Finally, pages for your collection will be added to your My Services page.

How your collection is stored

The online copies of QRIScloud collections (in Brisbane) are stored on large-scale disc arrays and tape in the Polaris data centre.

These NFS servers export the individual collection file-trees to various systems:

For Standard access
For NFS only

Disk versus HSM

QRIScloud collections are provisioned in different storage pools that have different performance characteristics. The biggest difference between the pools is whether they are purely disk-based or HSM.

Disk-based pools are (as the name suggest) held entirely on disk storage.
HSM pools are implemented using disk and a tape storage system with tape robots and multiple tape drives. The storage managed by a Hierarchical Storage Manager (HSM).

The principle of HSM is that files that have not been used recently can be moved to tape, so that the disk space can be used for other things. This movement is referred to as "migration" and typically happens automatically. If you attempt to access a file that has been migrated to tape, the HSM system will restore it to disk. This can take a minute or so, depending one how big the file is.

As a general rule, we provision collections on HSM if the predicted usage patterns warrant it. For example, HSM is typically chosen for archival storage, and storage of really large datasets. The downside of HSM is that there are performance issues for usage patterns that involve reading lots of files in a short space of time. There are various things you can do to optimise this. For example:

If you use "rsync" for incremental ingestion, correct choice of command line switches will avoid unnecessary tape retrieval.

Is your collection data backed up?

QRIScloud collections are NOT backed up in the conventional sense. If you (the collection owner) make a mistake that leads to files or directories being lost, then we (QRIScloud support) do not guarantee to be able recover them.

We have a replication process which saves copies of files to offline storage, both on-site and off-site. Replication is designed to allow recovery from catastrophic failure of the online storage media; i.e. disks, controllers and servers. It may be possible to restore a previous version of a file (that has been successfully replicated). However:

restoration of a version of a file from a specific time point is probably be impossible,
restoration of a file from more than 6 months ago is probably impossible, and
restoration is labor intensive and may take a long time.

The technical details of QRIScloud collection replication will be provided once they have been finalized. However, replication of HSM collections is now implemented.

Reading and writing your collection data

By now, you are probably impatient to know how you and your users are going to read and write files in the collection. In fact, there are a number of ways you / they can do this.

Simple tools for uploading or downloading files

The simple way to upload and download files is to use one SFTP, SCP or RSYNC via the Standard Access server: data.qriscloud.org.au. (Every registered QRIScloud user who has access to collection storage, has an account on this system. Use your QSAC account and password to access the systems.) Alternatively, if you have an account on one of the HPC systems, you could use those to upload and download files.

There are various tools that support SFTP, SCP and RSYNC on Windows, Mac and Linux platforms, but it is not possible to give a simple one-size-fits-all recommendation on what to use. Instead, we summarize the protocols and recommended tools as follows:

Protocol	Base Protocol	OS platform availability	Secure
SCP	SSH (port 22)	all	yes
SFTP	SSH (port 22)	all	yes
RSYNC	SSH (port 22)	all	yes

Table 1: Basic file transfer protocol summary.

Note that RSYNC is a file synchronization protocol rather than a straight-forward file transfer protocol. It is designed for the use-case where you want to make / keep the files in one location the same as files in another location.

Tool	Supported Protocols¹	OS platforms	GUI	Notes
CyberDuck	SFTP, SCP, RSYNC	Windows, Mac	yes	Recommended
Filezilla	SFTP, SCP, RSYNC	Windoows, Mac, Linux	yes
WinSCP	SFTP, SCP	Windows	yes
scp	SCP	Mac, Linux	no
sftp	SFTP	Mac, Linux	no
rsync	RSYNC	Mac, Linux	no

Table 2: Summary of Basic File Transfer Tools

Footnote 1: The GUI-based tools support various other file transfer protocols, but none that are currently relevant.

So, for example, if you want a GUI based tool that runs on Windows, both CyberDuck and WinSCP are good options. For details on how to use these tools, please refer to the following "break out" pages:

Using Cyberduck with QRIScloud collections
Using Filezilla with QRIScloud collections
Using Mac / Linux file transfer commands.

Notes:

All of the above are only available to people who know the account credentials for one of the collection's accounts. Anonymous access is not currently supported.
Cyberduck, Filezilla and WinSCP have "file synchronization" modes.
Cyberduck, Filezilla and WinSCP are "third-party" applications that need to be installed on your machine by someone with administrator privilege. You may need to ask your local IT support to help you with this.

Uploading or downloading huge datasets

The SFTP, SCP and RSYNC protocols all run over conventional "TCP/IP" streams with an encryption. This means that data transfer using these protocols is likely to be limited by both CPU performance and the performance of your local and long-haul networks. If you have really large datasets to transfer, please contact us to discuss possible alternatives.

Note that transferring large numbers of small files is liable to take a very long time. On the other hand, a collection containing large numbers of small files is liable to to be problematic for other reasons. Therefore, you should use ZIP or TAR or equivalent to bundle the file trees into an "archives" before uploading them to a collection.

NFS access

The above mechanisms all rely on using some command or tool (e.g. a web browser) to upload and download files. However, if you need to compute against the data in your collection, it may be more appropriate to use NFS (Network File System) to "mount" the collection on a NeCTAR VM. This allows applications running in the NeCTAR VM to read and write files in the collection using normal file I/O.

Before you can use NFS with your collection, you need to request QRIScloud support to enable NFS access for the virtuals in your NeCTAR project. For each NeCTAR VM that needs access, you need to install NFS client software, configure the VM's second network interface, set up the configuration files for NFS mounting, and create the Qxxxx accounts and groups on the VM.

There are some limitations and caveats on NFS access to QRIScloud RDSI collections:

NFS access is only available to systems in the same data centre as the collection.
The NFS servers are configured to export the collections with "root squashing". This means that the "root" user on your NeCTAR VM will not be able to change file ownership and change file modes on files in the mounted file system.
It is a bad idea to use an NFS mounted collection to hold "home directories" on your NeCTAR VM.

For more information, please refer to NFS mounting a QRIScloud collection.

Creating a custom data portal

One of the main aims of ARDC is to foster the sharing of research data with other researchers. In order to "publish" your data for other people to see, you may want to create a "data portal" of some kind that organizes and presents the data in a useful and attractive way. You may want to provide added value by providing browsing and searching, and tools for data analysis and for data centric collaboration.

The way to implement such functionality is to arrange that your collection is exported and NFS mounted on one or more NeCTAR VMs, and then build the portal on those VMs. QRIScloud support can offer you advice on how to build and manage such a portal. Depending on your requirements, QCIF Engineering Services may be able implement a portal for you.

Comments