This document is a starting point for people using Collection Storage that has been allocated through QCIF and provisioned in QRIScloud. Collections are are currently allocated under the RDSI ReDS scheme or the QCIF CDS scheme. Both schemes provide free merit-allocated storage to researchers and research groups.
QRIScloud collections are stored on disk and tape storage in data centres in Brisbane or Townsville. You can access the files using a variety of means, including:
- Standard Access
- using SSH-based file transfer protocols such as SFTP, SCP and RSYNC ... from any machine,
- using Nextcloud, or
- via the filesystem on Euramoo, Flashlite and Tinaroo.
- Mediaflux Access:
- using the Arcitecta Desktop aka Hydrogen (in a web browser),
- using the Arcitecta "thick client" aka Helium, or
- using a custom Mediaflux portal.
- NFS Only Access:
- via an NFS mount on a NeCTAR VM running in the QRIScloud availability zone,
- using access services implemented on a VM, or
- using a custom portal implemented on a VM.
(The three access categories are mutually exclusive.).
When we (QRIScloud) "deliver" a new collection to you, we will provide you with following information in a QRIScloud support ticket.
- The collection number; e.g. "Qnnnn" where "nnnn" is a 4 digit number.
- The (current) provisioned size.
- The name and type of (current) storage pool in which it resides.
- A brief summary of the access methods available for the collection.
We will also provide links to the relevant documentation for collection owners and users including:
- A link to this document.
- A link to the Collection Dos and Don'ts document.
- A link to the Collection Access Management document (when applicable)
- Links to the NFS, Nextcloud and Mediaflux documents (as applicable)
Finally, pages for your collection will be added to your My Services page.
How your collection is stored
The online copies of QRIScloud collections (in Brisbane) are stored on large-scale disc arrays in the Polaris data centre. The data is organized in pools of roughly 220 terabytes (usable), and exposed internally via high-availability NFS servers with automatic failover.
These NFS servers export the individual collection file-trees to various systems:
- For Standard access
- For NFS only
- For Mediaflux
Disk versus HSM
QRIScloud collections are provisioned in different storage pools that have different performance characteristics. The biggest difference between the pools is whether they are purely disk-based or HSM.
- Disk-based pools are (as the name suggest) held entirely on disk storage.
- HSM pools are implemented using disk and a tape storage system with tape robots and multiple tape drives. The storage managed by a Hierarchical Storage Manager (HSM).
The principle of HSM is that files that have not been used recently can be moved to tape, so that the disk space can be used for other things. This movement is referred to as "migration" and typically happens automatically. If you attempt to access a file that has been migrated to tape, the HSM system will restore it to disk. This can take a minute or so, depending one how big the file is.
As a general rule, we provision collections on HSM if the predicted usage patterns warrant it. For example, HSM is typically chosen for archival storage, and storage of really large datasets. The downside of HSM is that there are performance issues for usage patterns that involve reading lots of files in a short space of time. There are various things you can do to optimise this. For example:
- You can use the DMF "dmget" command to prefetch files from tape in batches.
- If you use "rsync" for incremental ingestion, correct choice of command line switches will avoid unnecessary tape retrieval.
Please refer to the VWranglers article on Using DMF for an overview of DMF commands, advice on using HSM effectively.
Is your collection data backed up?
QRIScloud collections are NOT backed up in the conventional sense. If you (the collection owner) make a mistake that leads to files or directories being lost, then we (QRIScloud support) do not guarantee to be able recover them.
We have a replication process which saves copies of files to offline storage, both on-site and off-site. Replication is designed to allow recovery from catastrophic failure of the online storage media; i.e. disks, controllers and servers. It may be possible to restore a previous version of a file (that has been successfully replicated). However:
- restoration of a version of a file from a specific time point is probably be impossible,
- restoration of a file from more than 6 months ago is probably impossible, and
- restoration is labor intensive and may take a long time.
The technical details of QRIScloud collection replication will be provided once they have been finalized. However, replication of HSM collections is now implemented.
Reading and writing your collection data
By now, you are probably impatient to know how you and your users are going to read and write files in the collection. In fact, there are a number of ways you / they can do this.
Simple tools for uploading or downloading files
The simple way to upload and download files is to use one SFTP, SCP or RSYNC via the Standard Access servers: ssh1.qriscloud.org.au and ssh2.qriscloud.org.au. (Every registered QRIScloud user has an account on these systems. Use your QSAC account and password to access the systems.) Alternatively, if you have an account on one of the HPC systems, you could use those to upload and download files.
There are various tools that support SFTP, SCP and RSYNC on Windows, Mac and Linux platforms, but it is not possible to give a simple one-size-fits-all recommendation on what to use. Instead, we summarize the protocols and recommended tools as follows:
|Protocol||Base Protocol||OS platform availability||Secure|
|SCP||SSH (port 22)||all||yes|
|SFTP||SSH (port 22)||all||yes|
|RSYNC||SSH (port 22)||all||yes|
Table 1: Basic file transfer protocol summary.
Note that RSYNC is a file synchronization protocol rather than a straight-forward file transfer protocol. It is designed for the use-case where you want to make / keep the files in one location the same as files in another location.
|CyberDuck||SFTP, SCP, RSYNC||Windows, Mac||yes||Recommended|
|Filezilla||SFTP, SCP, RSYNC||Windoows, Mac, Linux||yes|
Table 2: Summary of Basic File Transfer Tools
Footnote 1: The GUI-based tools support various other file transfer protocols, but none that are currently relevant.
So, for example, if you want a GUI based tool that runs on Windows, both CyberDuck and WinSCP are good options. For details on how to use these tools, please refer to the following "break out" pages:
- Using Cyberduck with QRIScloud collections
- Using Filezilla with QRIScloud collections
- Using Mac / Linux file transfer commands.
- All of the above are only available to people who know the account credentials for one of the collection's accounts. Anonymous access is not currently supported.
- Cyberduck, Filezilla and WinSCP have "file synchronization" modes.
- Cyberduck, Filezilla and WinSCP are "third-party" applications that need to be installed on your machine by someone with administrator privilege. You may need to ask your local IT support to help you with this.
Uploading or downloading huge datasets
The SFTP, SCP and RSYNC protocols all run over conventional "TCP/IP" streams with an encryption. This means that data transfer using these protocols is likely to be limited by both CPU performance and the performance of your local and long-haul networks. If you have really large datasets to transfer, please contact us to discuss possible alternatives.
Note that transferring large numbers of small files is liable to take a very long time. On the other hand, a collection containing large numbers of small files is liable to to be problematic for other reasons. Therefore, you should use ZIP or TAR or equivalent to bundle the file trees into an "archives" before uploading them to a collection.
The above mechanisms all rely on using some command or tool (e.g. a web browser) to upload and download files. However, if you need to compute against the data in your collection, it may be more appropriate to use NFS (Network File System) to "mount" the collection on a NeCTAR VM. This allows applications running in the NeCTAR VM to read and write files in the collection using normal file I/O.
Before you can use NFS with your collection, you need to request QRIScloud support to enable NFS access for the virtuals in your NeCTAR project. For each NeCTAR VM that needs access, you need to install NFS client software, configure the VM's second network interface, set up the configuration files for NFS mounting, and create the Qxxxx accounts and groups on the VM.
There are some limitations and caveats on NFS access to QRIScloud RDSI collections:
- NFS access is only available to systems in the same data centre as the collection.
- The NFS servers are configured to export the collections with "root squashing". This means that the "root" user on your NeCTAR VM will not be able to change file ownership and change file modes on files in the mounted file system.
- It is a bad idea to use an NFS mounted collection to hold "home directories" on your NeCTAR VM.
For more information, please refer to NFS mounting a QRIScloud collection.
Creating a custom data portal
One of the main aims of RDSI is to foster the sharing of research data with other researchers. In order to "publish" your data for other people to see, you may want to create a "data portal" of some kind that organizes and presents the data in a useful and attractive way. You may want to provide added value by providing browsing and searching, and tools for data analysis and for data centric collaboration.
The way to implement such functionality is to arrange that your collection is exported and NFS mounted on one or more NeCTAR VMs, and then build the portal on those VMs. QRIScloud support can offer you advice on how to build and manage such a portal. Depending on your requirements, QCIF Engineering Services may be able implement a portal for you.