Section 3: Basic QRISdata questions
By analogy with cloud computing, cloud storage is where you store your data on a infrastructure run by someone else "on the internet".
Yes it is. The models for resourcing and allocating QRISdata storage are different from commercial cloud storage providers, but QRISdata storage qualifies as cloud storage.
QRISdata storage is held in physically secure data centres in Brisbane and Townsville, on systems that are managed by IT professionals. It provides a more reliable place to store research data than putting it on USB drives and portable hard disks.
QRISdata is designed to complement well-managed data storage provided by your institution. In addition to simple storage, QRISdata's merit-based allocation processes encourage sharing and publication of data as a way to foster research collaboration, and preserve valuable research results.
You can start the process of applying for QRISdata storage via the QRIScloud Portal's Services page. Someone from the QCIF eRA team will then contact you to discuss your needs, and prepare the formal RDSI storage application.
You can apply for as much storage as you need: a few gigabytes to hundreds of terabytes, or even more. It is a relatively simple process to increase the allocation size, if you find that you need more space; see FAQ 3.9 on how collection usage limits are implemented.
However there are a couple of practical caveats:
- Requests for large amounts of storage are subject to capacity constraints.
- For really large requests, we are only able to provide HSM collections; see FAQ 3.3.4.
Please do not apply for more space then you need. Our stakeholders measure us on actual data stored, not on allocated storage. If you are allocated storage and don't use it within the agreed time-frame, we reserve the right to take it back.
- Request for 1 terabyte or more will be assessed by QCIF RDSI Resource Allocation Committee (RAC). This can take up to 1 month, so we will often "pre-allocate" some storage for you to get started.
- For smaller storage requests, a "fast track" assessment procedure is used.
Your storage will then be provisioned and we will send you collection's details and links to the QRIScloud collection storage documentation. We will SMS the collection's passwords to your mobile phone number.
No. RDSI and NeCTAR resources are requested, allocated and accounted using different processes and mechanisms.
There is no clear official definition from RDSI that says what a Collection is. However, we take it to encompass a collection of data (e.g. files) that is related to scientific / academic research activities.
ReDS stands for Research Data Services. It is / was the merit allocated component of RDSI storage capacity. ReDS collections are intended to be research data holdings of lasting value and importance.
CDS stands for Collection Development Storage. It represents the 20% of storage capacity that was set aside for each RDSI "node" for the purposes of developing collections that would later qualify as ReDS collections.
Data is eligible for storage as RDSI ReDS storage if it is "nationally significant data". This is defined by RDSI as data that is:
- "viable and relevant as an input for future research",
- "intended to be available to and usable by other researchers, with adequate supporting metadata", and
- "recognised by a research organisation as valuable".
The RDSI merit allocation process takes into account a number of factors including:
- The storage capacity of the node
- Your time-frame and readiness to ingest data.
- The significance of the data collection itself; see above.
Applications for well-described, widely relevant data collections that can be quickly ingested and openly shared are viewed favourably.
We currently offer two kinds of NFS-based RDSI storage:
- Disk storage for collections up to 100 terabytes.
- HSM storage for larger collections, and for collections that do not need to be immediately accessible.
In the future we will also be adding:
- RDSI volume storage, which will have the same operational properties as NeCTAR volume storage.
- Collection storage managed by Mediaflux.
The storage that holds collection data is divided into Storage Pools. Each pool corresponds to a file system on one of the NFS servers.
- The main "Tier 1" and "Tier 2" disk storage pools are 229 terabytes in size. (There are also some smaller pools for special use-cases.)
- There are three storage pools for HSM storage:
- The "Tier 3a" pool has a large (120 terabyte) front-end disk cache, and is designed for use-cases where files are likely to be accessed occasionally.
- The "Tier 3b" pool has a smaller (30 terabyte) disk cache, and is designed for holding archival data.
- The "Tier 3c" pool also has 30 terabyte disk cache, but it is configured with a two-copy replication policy, and is reserved for replicas of collections where the primary copy is at a different RDSI node.
This size was chosen so that a pool can be restored from tape in 24 hours in the event of a total failure of the pool's file system.
We don't have enough disk space to give everyone what they would like. Building and running a large-scale disk array costs a significant amount of money. Since QRIScloud is not operated on a "cost recovery" basis, the economics means that resources need to be rationed.
HSM stands for Hierarchical Storage Management. With HSM, the primary copy of your files will be held on tapes, with a cache copy held on disk for faster access. The problem is that the HSM disk caches are relatively small compared to the amount of data help on tape. If you try to access a file that is no longer in the cache it has to be retrieved from tape.
As a consequence, HSM is not suitable if you require fast access to your files at all times. Furthermore, if you are going to access a number of files in a short space of time, you need to the DMF tools to instruct the HSM system to perform a bulk retrieval.
As of January 2016, we will provide a new collection owner with the collection identifier, together with instructions on how to access the collection and how to manage the access control groups (if applicable).
Each collection has a unique identifier of the form "Qnnnn" where "nnnn" is a 4 digit number.
The access methods determine how you and your collaborators will be able to access the data in your collection. This needs to be specified when a collection is provisioned. There are currently 3 access methods to choose from for new collections Standard, NFS-only and Mediaflux. Some old collections use a Legacy access method, but these are in the process of being migrated (Q 3.4.7).
The access method also determines how per-user access control works:
- For Standard and Mediaflux collections, per-user access is implemented using access groups.
- For Legacy collections, access control uses per-collection shared credentials.
- For NFS-only collections, access control is your responsibility.
The Standard access methods allow you to read and write your data using a variety of tools. These tools include Cyberduck, Filezilla and WinSCP, various command-line utilities, and fast file transfer tools like GridFTP and Aspera.
In addition, collections with Standard access are NFS mounted on Euramoo, Flashlite and (soon) Tinaroo. This will allow users of these systems who are members of the appropriate access groups to be able to access the collection data via the file system.
The NFS-only access method allows you to NFS mount the collection on NeCTAR instances running in the QRIScloud availability zone. You need to nominate the NeCTAR tenants that can mount the collection, and then you need to configure the mounts on each instance. Once you have done that, you can implement whatever data access services and access controls you want.
For more information, please refer to NFS access to QRIScloud Collections.
Mediaflux is a sophisticated and powerful data management product that is available for managing RDSI collections. For more details, please refer to the Getting Started with Mediaflux document, and the collection of training videos that it links to.
Yes, you can change your mind.
- Switching between "Standard" and "NFS-only" is relatively straight-forward, but it does entail a collection outage.
- Switching between "Mediaflux" and other access methods is a significant amount of work as it entails copying all existing data in your collection. This is likely to require an extensive outage.
We are currently in the process of migrating all collections to use one of the new access methods. Please refer to Upcoming changes to QRIScloud Collections for an overview. All owners of affected collections have been contacted about the migration, and the need to choose an access method.
There is a document called Guide to Managing Collection Access that explains how a collection administrator can manage user access. Note that this only applies to collection configured with Standard or Mediaflux access methods.
An access group is essentially a managed list of (specific) people with access to a (specific) collection. Each collection has two access groups: a "read-only" group and a "read-write" group. Within each group, a person can have one of three roles: "user", "administrator" or "owner".
The procedure for "inviting" a person is described in the Guide to Managing Collection Access. This procedure works for any person with current AAF access. If you wish to "invite" people who does not have AAF access, please contact QRIScloud Support.
The procedure is described in the Guide to Managing Collection Access.
RDSI collections that are designated as "public" will appear in the Collections register. A collection page in the register will give:
- The collection's title and FoR codes.
- A link to the collection's portal (if one has been created and registered by the collection owner.
- Links for requesting direct collection access, subject to approval by the access group owners or administrators.
We do not provide a way for users to find out about non-public collections. However, if you are aware of a collection and want to request access, you can email the owner and ask them to "invite" you.
If you were expecting to be invited, copy-and-paste the URL to your web browsers. If you were not expecting it, either ignore it (do nothing!) or reach out to the person who (apparently) sent it to you.
A valid "user" invitation URL will look like this:
where <hex-digits> is a string of 32 digits and letters ('a' through 'f'). If it looks different, be suspicious.
The first thing that happens is that you are directed the QRIScloud Portal's AAF login page:
- If you belong to an AAF member organization (e.g. any Australian University), login as described in FAQ 1.5.4.
- If you do not have an AAF login, you can apply to QCIF for an AAF VHO account as described in FAQ 1.5.6. Once that has been set up, start this procedure again, using the AAF VHO as your organization.
If you don't have QRIScloud account associated with your AAF identity, the next thing that happens is that you will be asked to register, and acknowledge the QRIScloud Terms and Conditions.
Finally, you will be sent to a form for requesting access to the collection. Fill it in and submit it, and your request will be sent to the collection's owners / administrators for approval.
Q3.5.7 - What happens when my request is approved?
When your request is approved, you will receive an email from the QRIScloud portal with instructions on accessing the collection.
In your MyServices page, you can see all of the collection access groups that you belong to. If you click on the link for a collection group, you will get a page that gives some details on accessing the collection.
No. QRISdata does not provide a backup service for collection data. Instead, we replicate the data to protect against storage system failures and major data centre catastrophes.
With a backup system, you would have a reasonable expectation that we could restore your files if you accidentally deleted or overwrite them. A typical backup system guarantees to keep old versions of files for months or years, and provides mechanisms that allow the backup administrator to restore them.
In a replication system, the primary goal to keep copies of files so that we can restore to the most recent "known good" state of a collection. Restoration of files from older states may be possible, but is not the primary goal.
For on-disk collections, the file system is scanned periodically, and any file that has changed since the last scan is copied to a "shadow" HSM system.
For HSM collections (and the "shadow" HSM for on-disk collections), the replicas are created by the HSM system itself. The normal replication policy is to create two tape replicas of each file in the tape store in the Polaris data centre, and a third replica in the tape store in the St Lucia data centre.
The design goals for QRISdata collection replication state that the first on-site tape replica should be completed withing 24 hours, and that the off-site replica should be completed within 48 hours.
The design goals for QRISdata collection replication state that on-site replica of a file should be retained for 4 weeks after deletion, and the off-site replica should be retained for 12 weeks.
In the design phase, RDSI decided no to fund the provision of backup for collection storage. Instead, they directed their funding to maximize the available storage.
One of the issues is that a tape library system (as used in a typical HSM system) can hold a maximum number of tapes. Once all of the tapes are full of data, you have to start manually adding and removing tapes. In our case, it is simply not practical to do this on a day-to-day basis, because the tape libraries are in a data centre that is 30 kilometers away, and we do not have operation staff stationed there.
Collection data can be accessed in the following ways, depending on the collection's access method:
- For Standard access:
- Using the "data.qriscloud.org.au" access system; see Q3.7.1.
- Using SSH based protocols such as "scp", "rsync" and "sftp" via the above system; see Q3.7.1.
- Using Globus GridFTP; see Q3.7.2.
- Using Aspera Shares or Aspera Drive; see Q3.7.3.
- Via auto-provisioned NFS mounts on Euramoo, Flashlite and Tinaroo
- For Mediaflux access; see Q3.4.5.
- Using the Arcitecta Desktop
- Using the Arcitecta File Explorer
- Using a custom Mediaflux portal.
- For NFS-only access; see Q3.7.4
- Via an NFS mount on a NeCTAR VM
- Using data access services that you set up on such a VM
The "data.qriscloud.org.au" system is a load-balancer for two access machines: "ssh1.qriscloud.org.au" and "ssh2.qriscloud.org.au". These machines allow you to read and write files using SSH and SSH-based file transfer protocols.
- You can login to the machines ("data", "ssh1" or "ssh2") using an SSH client such as the "ssh" command on Mac OSX and Linux, or "putty" on Windows:
- Use your QSAC to login, or your UQ credentials if the account names match: see Q1.2.6, Q1.2.7 & Q1.2.8.
- Once logged in, you will have a standard Linux command environment, similar to what you have when you connect to a NeCTAR VM and typical HPC systems.
- The files for each collection are auto-mounted as "Qnnnn" directories beneath the "/data" directory. Access is restricted to users who are members of the respective collection access groups.
- You can use a desktop file transfer command or tool to copy files between the your desktop and the collection via the access machines:
- On Windows you can use Cyberduck, Filezilla, WinSCP among others.
- On Mac OSX or Linux you can use Cyberduck (Mac OSX only), Filezilla and command line tools such "scp", "rsync" or "sftp".
- The path to your collection will be as above; i.e. "/data/Qnnnn", where "Qnnnn" is your collection's identifier.
- The "sh1" and "ssh2" machines are preferred over "data". The latter is connection rate-limited which can cause problems when transferring lots of files.
- If you have an account on one of the HPC systems that mount the collections (i.e. Euramoo, Flashlite and Tinaroo) you can access the files via the "/RDS" directory.
Note: if you are transferring large numbers of small files, you will typically get better performance using "rsync" rather than "scp" or "sftp". The latter need to create a separate SSH-enabled TCP connection for each file transferred. When the files are small, the connection overhead is large compared to the time taken to transfer the bytes.
Globus GridFTP is a file transfer protocol that is designed for large-scale transfers on high bandwidth networks. The Globus GridFTP for data transfer document provides brief instruction on using GridFTP with QRISData collections.
Aspera is a suite of proprietary high-speed file transfer software. We have a license that allows all QRIScloud users to download, install and use the client-side software. The primary services are:
- Aspera Shares which provides web-based access to your collection.
- Aspera Drive which supports "synchronization" of files between a collection and another computer.
For more details, please refer to Getting Started with Aspera.
- Unfortunately, we have been unable to get Aspera to work properly from network locations in the USQ campus due (we believe) to issues with USQ's network firewalls. We recommend that you use SSH based protocols instead. (Or QRIScloud's NextCloud service.)
- Aspera does not perform well for transfers that involve moving a lot of small files. If you have many small files to transfer you will get a much better file throughput using "rsync"; see Q3.7.1.
- We no longer support the Aspera "ascp" command line tool.
The first step is that the collection manager needs to lodge a QRIScloud support request to "export" the collection to a specified NeCTAR project. Once that has been done, the NFS access to QRIScloud Collections document explains how to set up an NFS mount on a NeCTAR instance.
You are free to install and use other software on your NeCTAR instance, and use that to manage your collection data. However, the onus will be on you to manage security and access control.
Note: NFS access is only available for "NFS-only" collections. NFS-only and Standard Access are mutually exclusive, and NFS-only collections are NOT auto-mounted on the HPC / HTC systems.
That is a complicated question.
- On the one hand, exporting your collection does not directly expose it to the internet. The exported collection is only directly accessible via a private IP address, and it should be impossible for anything outside of the Polaris data centre to access it.
- On the other hand, once your collection has been attached to an instance in your NeCTAR project, anyone who has (or can gain) privileged access to the instance has unfettered access to that data.
Thus, while providing NFS access to your collection is not insecure per se, it is definitely increasing the risk to your data.
(In theory, the cryptolocker problem exists for NFS mounts on Linux; see Q3.7.6. However there have been no reports of cryptolocker criminals targeting Linux systems. A typical Linux-based NeCTAR instance is less likely to be targeted because 1) it is not a Window system, and 2) because the standard NeCTAR images don't include a web browser or email client. Web pages and email are the most common attack vectors.)
We strongly recommend that you DO NOT mount your collection on your home system (or any other system outside of QRIScloud) as a "network share". Doing this will place your RDSI data at risk. Please read this page for more information.
The short explanation is that Windows network shares make your data vulnerable if your home system is infected with cryptolocker ransomware. If your RDSI collection data does get locked by cryptolocker criminals, please contact QRIScloud support urgently.
The normal method for "ingesting" data into a collection is to go to the system where the data currently lives and "upload" it to the collection using one of the supported access methods; see Q3.7.
If your data is held on removable media (e.g. external hard drives, memory sticks, DVDs) you will need to plug them in one at a time.
If you have really large amounts of data to ingest, then conventional upload is liable to be problematic. Transferring terabytes of data over the network takes a long time, especially if your local networking is slow. Some of the alternatives that we can try include:
- Using a high bandwidth transfer method such as Globus GridFTP or Aspera
- Running transfers as background processes.
- Optimize transfer patterns; e.g. instead of downloading files from somewhere to your laptop or PC and then uploading them, transfer them directly.
If have lots of really small files to ingest / upload then you are going to be in for a hard time, no matter what approach you take. We strongly advise you NOT to do this. Instead, we recommend that you use a utility like "zip" or "tar" to bundle up the files into larger "archive" files before you upload them. If you need to compute against the little files, there are two approaches:
- Copy or download the ZIP / TAR archive file to a local file system on the machine where you are doing the computation, unzip / untar the bundle into a local directory (i.e. not NFS mounted!) and compute against that tree.
- Modify your application so that it can open and use the archive file directly. (There are standard runtime libraries for doing this in most mainstream programming languages.)
QCIF has a couple of systems that provide alternatives to over-the-net uploading.
- DustBuster is a portable 20 terabyte NAS system. We can loan you this system temporarily to load up your data. When you return the system, we can plug it into a fast network and upload the data to your collection.
- Hoover is a system that we can use to read and upload data from a range of portable USB media.
We have access to fast node-to-node data transfer mechanisms that can be used for high-volume ingestion.
If you need advice or assistance with ingesting your data, please contact QRIScloud support.
For Tier 1 and Tier 2 collections, we enforce quotas on each collection's data usage. For Tier 3 (HSM) collections, quotas are not currently enforced.
Disk usage for on-disk collections have associate file quotas implemented using the XFS quota mechanism on the NFS server:
- The "soft" quotas are set to your current allocation size. (The quota system allows you to exceed the soft quota for a short period of time; see Q3.9.2.)
- The "hard" quota is set to a value larger than your current allocation size.
Currently, the hard quota is set at twice the allocation size, but this is subject to change.
The following applies to Tier 1 or Tier 2 collections:
- When the collection goes over its soft quota, you should start seeing a message each time you login to the collection VM, warning you that you need to reduce your data usage. You can also check this by running the "quota" command.
- When the collection goes over its hard quota, you will be unable to create or modify files. The only thing that you can do is to delete files.
- If a collection is over soft quota for more than 50 days, the quota violation escalates to a hard quota violation.
You need to either delete files or take other steps to reduce your usage below your allocated amount, OR request an increase in your RDSI allocation.
You should not wait until the soft quota violation escalates to a hard quota violation. When your collection is in that state, it is difficult to recover to a state where you can use your collection normally.
Contact QRIScloud support urgently and we will advise you on how to proceed.
BEWARE: The only thing that you can safely do when you are over hard quota is to delete files and directories. If you attempt to compress files or create ZIP files or TAR files in place, your files are liable to be truncated and lost.
Not yet. Currently, we only implement quotas on bytes stored, but we are also considering imposing quotas on the file counts.
Collections that contain large number of small files present a number of technical and operational problems. Data access performance is impacted, replication scanning is impacted, writing of replicas to tape is impacted, migration of collections is impacted.
- If your collection is configured with "Standard" or "Mediaflux" access, then you can grant read-only or read-write access to your collection to anyone who has a QRIScloud account. (Anyone with AAF access can get a QRIScloud account.)
- If your collection is configured as "NFS only", it is up to you implement your own collection access controls.
No. Usually your institution's library manages Digital Object Identifiers (DOI) and other identifiers for publications and data. Please contact your library for institution-specific questions, or Belinda Weaver at QCIF for general questions.
- We don't currently have a way to implement open access for Standard Access or Mediaflux collections. However, we can include your collection in the list of collections that other QRIScloud users can request access to.
- If your collection is NFS-only, then you are free to expose the data as you see fit. However, you need to be mindful of data security and privacy concerns.
This is not something that is currently supported. There are some other options:
- You could NFS mount your collection on a NeCTAR instance, then implement a portal that does fine-grained access control.
- A Mediaflux collection could be configured to provide fine-grained control.
- It is technically possible to split a collection into sub-collections with separate read-write and read-only access groups. However, this is results in overhead for QCIF operational and administrative staff.
Your collection's metadata includes things such as:
- The collection's title and description
- The collection's custodian, requester and technical contacts.
- The collection's FoR codes.
- The organization and organizational unit that the collection belongs to.
- The URL of a public portal for the collection.
- The URL of a public metadata record for the collection.
If you need changes to be made to the metadata for your collection, please submit a QRIScloud support request.
Collections are stored as files and directories on a POSIX compliant file system. This that the low-level access control mechanisms for files and directories in a collection are based on POSIX users and groups, POSIX file permissions and the POSIX ACL (access control list).
By contrast, the QRIScloud collection access model is based on each collection having one group of users with read-write (RW) access to the collection, and a second group of users who have read-only (RO) access. Ideally, the person who created the file is not supposed to have more access than anyone else.
The QRIScloud access model is implemented as follows:
- The RW and RO groups are implemented as POSIX groups. Thus collection Q0042 has POSIX groups called Q0042RW and Q0042RO.
- Each collection user maps to a distinct POSIX user.
- There is a distinct collection user identity for each collection (e.g. Q0042) which is used when the actual user identities is not available. For example, files and directories ingested using Aspera are owned by the collection's user identity.
- Each file or directory should have "rwx" as its owner and group access settings, and "---" for "other" access.
- Each file or directory should have the collection's RW group as its POSIX group.
- All directories should have the sticky "inherit group" access bit set, so that newly created subdirectories are also owned by the RW group.
- All files and directories should inherit ACLs that:
- Grant "r-x" access to members of the collection's RO group.
- Provide default group access and umask settings.
- Forbid access to "other" users.
Unfortunately, there are:
- The person who creates a file or directory will be the POSIX owner of the object. That means they will be able to change the access bits and the ACLs.
- If a file or directory (somehow) has access other than "rwxrwx---" / "drwxrws---", or has its POSIX group set incorrectly, or has its ACLS set incorrectly, it could prevent users in the RW or RO from doing what they should be able to do. The problems can spread to newly created subdirectories of a "broken" parent directory.
If the owners, permissions or ACLs on a file or directory are incorrect, you have two options:
- If you are the owner of the file or directory (according to the "ls -l" command) you should be able to use the "chmod" command to correct permissions, and the "setfacl" command to add missing ACLs.
- If you are not the owner, or if you need to change the owner, then you should raise a support ticket.
It is also worth noting is that fixing a collection takes time in proportion to the number of files. This is another case where having lots of small files causes pain.
Sometimes QRIScloud operations staff need to move a collection from one physical storage medium to another. Unfortunately, our infrastructure does not allow this to be done without an outage. It is also sometimes necessary to make access control changes that impact on your ability to use the collection.
If we need to migrate your collection, we will contact you and provide you with a detailed description of the migration procedure. There are some points in the procedure where we need to coordinate with you (the collection owner) via a support ticket. We would ask you to respond to migration tickets promptly.
We have written a "QRIScloud Collections Dos and Don'ts" document with recommendations on how to use QRISData collections safely, and without causing operational problems.