Using QRIScloud Collection Storage

Follow

Introduction

This document is a starting point for people using Collection Storage that has been allocated through QCIF and provisioned in QRIScloud.  Collections are are currently allocated under the RDSI ReDS scheme or the QCIF CDS scheme.  Both schemes provide free merit-allocated storage to researchers and research groups.

QRIScloud collections are stored on disk and tape storage in data centres in Brisbane or Townsville.  You can access the files using a variety of means, including:

  • Standard Access
    • using SSH-based file transfer protocols such as SFTP, SCP and RSYNC,
    • using Globus GridFTP,
    • using Aspera Shares or Aspera Sync, or
    • via the filesystem on Euramoo, Flashlite and Tinaroo.
  • Mediaflux Access:
    • using the Arcitecta Desktop aka Hydrogen (in a web browser),
    • using the Arcitecta "thick client" aka Helium, or
    • using a custom Mediaflux portal.
  • NFS Only Access:
    • via an NFS mount on a NeCTAR VM running in the QRIScloud availability zone,
    • using access services implemented on a VM, or
    • using a custom portal implemented on a VM.

(Aspera and Globus GridFTP are primarily intended for bulk transfers; e.g. in 1Tb+ range, and require some additional setup.)

(The three access categories are mutually exclusive.).

You can apply for QRIScloud Collection Storage from the QRIScloud Service Catalog.

Collection Delivery

When we (QRIScloud) "deliver" a new collection to you, we will provide you with following information in a QRIScloud support ticket.

  • The collection number; e.g. "Qnnnn" where "nnnn" is a 4 digit number.
  • The (current) provisioned size.
  • The name and type of (current) storage pool in which it resides.
  • A brief summary of the access methods available for the collection.

We will also provide links to the relevant documentation for collection owners and users including:

  • A link to this document.
  • A link to the Collection Dos and Don'ts document.
  • A link to the Collection Access Management document (when applicable)
  • Links to the NFS, Aspera, Mediaflux and GridFTP documents (as applicable)

Finally, pages for your collection will be added to your My Services page.

How your collection is stored

The online copies of QRIScloud collections (in Brisbane) are stored on large-scale disc arrays in the Polaris data centre.  The data is organized in pools of roughly 220 terabytes (usable), and exposed internally via high-availability NFS servers with automatic failover. 

These NFS servers export the individual collection file-trees to various systems:

  • For Standard access
  • For NFS only
  • For Mediaflux

Disk versus HSM

QRIScloud collections are provisioned in different storage pools that have different performance characteristics.  The biggest difference between the pools is whether they are purely disk-based or HSM.

  • Disk-based pools are (as the name suggest) held entirely on disk storage.
  • HSM pools are implemented using disk and a tape storage system with tape robots and multiple tape drives. The storage managed by a Hierarchical Storage Manager (HSM).

The principle of HSM is that files that have not been used recently can be moved to tape, so that the disk space can be used for other things.  This movement is referred to as "migration" and typically happens automatically. If you attempt to access a file that has been migrated to tape, the HSM system will restore it to disk.  This can take a minute or so, depending one how big the file is.

As a general rule, we provision collections on HSM if the predicted usage patterns warrant it.  For example, HSM is typically chosen for archival storage, and storage of really large datasets.  The  downside of HSM is that there are performance issues for usage patterns that involve reading lots of files in a short space of time. There are various things you can do to optimise this.  For example:

  • You can use the DMF "dmget" command to prefetch files from tape in batches.
  • If you use "rsync" for incremental ingestion, correct choice of command line switches will avoid unnecessary tape retrieval.

Please refer to the VWranglers article on Using DMF for an overview of DMF commands, advice on using HSM effectively.

Is your collection data backed up?

QRIScloud collections are NOT backed up in the conventional sense.  If you (the collection owner) make a mistake that leads to files or directories being lost, then we (QRIScloud support) do not offer a service to recover them.

We are currently rolling out a replication process where copies of each file will be saved to tape, both on-site and off-site.  However replication is designed to allow recovery from catastrophic failure of the online storage media; i.e. disks, controllers and servers.  It is not designed to provide a conventional backup of individual files

The technical details of QRIScloud collection replication will be provided once they have been finalized. However, replication of HSM collections is now implemented.

Reading and writing your collection data

By now, you are probably impatient to know how you and your users are going to read and write files in the collection.  In fact, there are a number of ways you / they can do this.

Simple tools for uploading or downloading files

The simple way to upload and download files is to use one SFTP, SCP or RSYNC. There are various tools that support these protocols on Windows, Mac and Linux platforms, but it is not possible to give a simple one-size-fits-all recommendation on what to use.  Instead, we summarize the protocols and recommended tools as follows:

Protocol Base Protocol OS platform availability Secure
SCP SSH (port 25) all yes
SFTP SSH (port 25) all yes
RSYNC SSH (port 25) all yes

 Table 1: Basic file transfer protocol summary.

Note that RSYNC is a file synchronization protocol rather than a straight-forward file transfer protocol. It is designed for the use-case where you want to make / keep the files in one location the same as files in another location.

Tool Supported Protocols1
OS platforms GUI Notes
CyberDuck SFTP, SCP, RSYNC Windows, Mac yes Recommended
Filezilla SFTP, SCP, RSYNC Windoows, Mac, Linux yes  
WinSCP SFTP, SCP Windows yes  
scp SCP Mac, Linux no  
sftp SFTP Mac, Linux no  
rsync RSYNC Mac, Linux no  

Table 2: Summary of Basic File Transfer Tools

Footnote 1: The GUI-based tools support various other file transfer protocols, but none that are currently relevant.

So, for example, if you want a GUI based tool that runs on Windows, both CyberDuck and WinSCP are good options. For details on how to use these tools, please refer to the following "break out" pages:

  • Using Cyberduck with QRIScloud collections
  • Using Filezilla with QRIScloud collections
  • Using Mac / Linux file transfer commands.

Notes:

  1. All of the above are only available to people who know the account credentials for one of the collection's accounts.  Anonymous access is not currently supported.
  2. Cyberduck, Filezilla and WinSCP have "file synchronization" modes.
  3. Cyberduck, Filezilla and WinSCP are "third-party" applications that need to be installed on your machine by someone with administrator privilege.  You may need to ask your local IT support to help you with this.

Uploading or downloading huge datasets

The SFTP, SCP and RSYNC protocols all run over conventional "TCP/IP" streams with an encryption.  Without getting too technical, this means that data transfer using these protocols is likely to be limited by both CPU performance and the performance of your local and long-haul networks.

If you have really large datasets to transfer, it is worth considering alternative file transfer tools and protocols that are designed for bulk data transfer.  We currently offer two such tools:

  • Globus GridFTP provides increased data transfer rates by running multiple TCP/IP file transfers in parallel. Globus GridFTP is enabled for all collections, but you need to sign up with Globus, and download and install "end point" software on your own computer.  For a summary of the steps involved, please read "Globus GridFTP - for data transfer".
  • Aspera provides increased data transfer rates using UDP/IP (datagrams) to transfer the data, and some proprietary "magic" to handle rate limiting, and retransmission. If you want to use Aspera, you need to contact QRIScloud support to configure access to your collection for the Aspera service.  Then you need to download and install the Aspera plugin for your web browser.  For a summary of the steps involved, please read "Using Aspera for large file transfers".

We have also had some success with "pull" ingestion from remote FTP servers using the Linux "lftp" command. The "lftp" command is capable of performing multiple file transfers in parallel.  If you want to try this approach you will need to connect your collection to a NeCTAR VM via an NFS mount (see below) and install and run "lftp" on the virtual.

NFS access

The above mechanisms all rely on using some command or tool (e.g. a web browser) to upload and download files.  However, if you need to compute against the data in your collection, it may be more appropriate to use NFS (Network File System) to "mount" the collection on a NeCTAR VM.  This allows applications running in the NeCTAR VM to read and write files in the collection using normal file I/O.

Before you can use NFS with your collection, you need to request QRIScloud support to enable NFS access for the virtuals in your NeCTAR project. For each NeCTAR VM that needs access, you need to install NFS client software, configure the VM's second network interface, set up the configuration files for NFS mounting, and create the Qxxxx accounts and groups on the VM.

There are some limitations and caveats on NFS access to QRIScloud RDSI collections:

  • NFS access is only available to systems in the same data centre as the collection.
  • The NFS servers are configured to export the collections with "root squashing".  This means that the "root" user on your NeCTAR VM will not be able to change file ownership and change file modes on files in the mounted file system.
  • It is a bad idea to use an NFS mounted collection to hold "home directories" on your NeCTAR VM.

For more information, please refer to NFS mounting a QRIScloud collection.

Creating a custom data portal

One of the main aims of RDSI is to foster the sharing of research data with other researchers. In order to "publish" your data for other people to see, you may want to create a "data portal" of some kind that organizes and presents the data in a useful and attractive way.  You may want to provide added value by providing browsing and searching, and tools for data analysis and for data centric collaboration.

The way to implement such functionality is to arrange that your collection is exported and NFS mounted on one or more NeCTAR VMs, and then build the portal on those VMs. QRIScloud support can offer you advice on how to build and manage such a portal.  Depending on your requirements, QCIF Engineering Services may be able implement a portal for you.

Hints for troubleshooting

TBD

 

Have more questions? Submit a request

Comments

Powered by Zendesk