Section 2: Basic QRIScompute questions

Follow

Section 2: Basic QRIScompute questions

Q2.1 - What is cloud computing?

Cloud computing is a metaphor for doing computing tasks on a computer infrastructure run by someone else "on the internet". (The origins of the term are uncertain, and there is no single precise definition.) The difference between cloud computing and the classic IT service model is that the infrastructure you use is typically owned and run by service providers that are external to your organization.

In this case, QCIF is running the Brisbane and Townsville based cloud computing infrastructure on behalf of QCIF member organizations, and we form part of the NeCTAR Research Cloud which is an Australia-wide cloud computing federation with capacity in all States and in the ACT.

Q2.1.1 - What is a virtual machine?

Virtual machines (VM's) allow a physical computer to be shared among a number of users, with each user appearing to have exclusive access to the machine.

Virtual machines are typically implemented using software known as a "hypervisor" which mediates each virtual computer's access to the physical computer hardware, and stops the VMs from interfering with each other.

Q2.1.2 - Is cloud computing like HPC?

Not really.  Typical cloud computing systems are built using standard computing hardware that is optimized for economical performance rather than for speed.  By contrast, High Performance Computing (HPC) systems tend to provide high-end processors, providing some combination of large numbers of cores, lots of memory, high-performance inter-processor communication and high performance disk I/O.

Despite this, a lot of computational tasks that run on HPC systems will run just fine on a cloud computing facility.  If you want advice on this, please contact QRIScloud support, and we can arrange for an eResearch Analyst to look at your computational problem, and help you figure our the best way to address them.

Q2.2 - What is a NeCTAR cloud computing?

The NeCTAR Research Cloud is a federation of cloud computing facilities located in each of the Australian State capital cities, and Canberra.  The infrastructure is implemented and  managed using the OpenStack cloud computing framework.

Q2.2.1 - What is OpenStack?

"OpenStack is a set of software tools for building and managing cloud computing platforms for public and private clouds. Backed by some of the biggest companies in software development and hosting, as well as thousands of individual community members, many think that OpenStack is the future of cloud computing. OpenStack is managed by the OpenStack Foundation, a non-profit which oversees both development and community-building around the project" - source.

Q2.2.2 - What is a NeCTAR RC "Project"?

A NeCTAR RC project consists of a collection of resources (Instances, Objects, Volumes and so forth) that project members can use.  A project is managed by a project manager (who controls who the members are) and has an associate NeCTAR Allocation; see below.  (Refer to FAQ 4.1 for explanations of "Instance", "Objects" and "Volumes".)

Q2.2.3 - What is a NeCTAR RC "Project Trial" project?

A Project Trial (PT) is a NeCTAR RC project with limited resources and time-span that is intended to let you try out the cloud before you commit to using it.  A PT has the resources for running up to 2 instances using up to 2 VCPUs, and a time limit of 3 months.

Q2.2.4 - How do I get a PT?

Simply visit the NeCTAR RC Dashboard. You will first be directed to your home institution's AAF login page. Then you will be asked to read the NeCTAR terms and conditions. Finally, a PT project will be created automatically for you.

Q2.3 - How do I apply for NeCTAR RC resources?

Visit the NeCTAR RC Dashboard (see above), and fill in and submit an application using the Request an Allocation page.  You will need to set out your resource requirements and your project duration, and provide a research description and a technical justification for your resource request. NeCTAR RC resources allocated based on the research and technical merit of your application, the resources you are applying for, and resource availability.

We encourage you to contact QRIScloud support if you need help in making the application.  We can arrange for a QCIF eResearch analyst to advise and assist you.

Q2.3.1 - What is a NeCTAR Allocation?

A NeCTAR allocation is effectively permission for you and your team to use up to a certain level of NeCTAR cloud resources over a particular period of time.  The allocation provides the resource quotas for a NeCTAR project.

Q2.3.2 - What NeCTAR resources should I apply for?

The basic computational resources that you need to apply for are Instances, VCPUs and VCPU-hours.  The basic computational resources come with a modest amount of disk storage (see Flavor) that will be associated with your virtual machines.  In addition, you can apply for VM independent NeCTAR storage in the form of Object Storage and / or Volume Storage.

The terms used above (Instance, VCPU, VCPU-hours, virtual machine, Flavor, Volume Storage, Object Storage, etc) are explained in section 4 of the FAQ.

Q2.3.3 - Does a NeCTAR allocation guarantee me access?

Unfortunately, no.  A NeCTAR allocation gives you quotas for a given number of Instances and VCPUs. However, when you attempt launch an Instance, it can fail with this message:

This can be caused by a variety of things, but a common cause is that OpenStack could not find the required number of free cores or the required amount of memory in the specified Availability Zone. If this happens, you could try launching a smaller Instance, or launching in a different (less full) Availability Zone.

Q2.3.4 - Should NeCTAR be used for training?

Training is a grey area. The current NeCTAR policy is that the resources can be used for training purposes at the discretion of NeCTAR nodes. We advise the following:

  • It is inadvisable expect students or trainees to use their PTs in a training course.  The exception is for basic "How to get started with NeCTAR" training, with the proviso the trainer should instruct the users to Terminate their instances.
  • If a lecturer or trainer requests an allocation for training purposes, the onus is in him / her to ensure that best practice is followed:
    • If trainees are allowed to launch instances, they should be properly advised on how to secure instances, and on the need to Terminated them promptly.
    • In either case, the lecturer / trainer (or the staff member who requested the allocation on their behalf) is responsible for the "house keeping".
  • If an allocation is required for a student project work, the allocation should ideally be requested by the supervisor. Alternatively, the supervisor should be listed as the Chief Investigator. In either case, the supervisor should take responsibility once the project is completed.

Q2.4 - What can I run on QRIScompute?

We place no restrictions on the kinds of application that you can run, provided that they meet the general rules set out by NeCTAR.

Q2.4.1 - Can I run Microsoft Windows on QRIScompute?

The simple answer is No.

For legal reasons, we (QCIF) cannot run Microsoft Windows on any of the hardware that runs QRIScloud, and we cannot allow users to do this either.

Q2.4.2 - Can I run licensed software on QRIScompute?

In principle, yes. However, we cannot give a definite answer to this question without examining what the license conditions are. Please contact QRIScloud support for advice.

Note that software licenses that are tied to specified IP addresses or MAC addresses can be problematic.

Q2.4.3 - Can I run application "XYZ" on QRIScompute?

Generally speaking, if an application runs on a modern version / distribution of the Linux operating system, it will run on a QRIScompute virtual machine. 

Q2.4.4 - Can I run GUI based applications on QRIScompute?

Yes, you can. However, the standard NeCTAR Linux images do not have a "desktop environment" installed, so you will typically need to install a lot of related "packages".

Q2.5 - Can I get a Service Account?

NeCTAR has recently started providing service accounts for NeCTAR projects that need them.

If you are a NeCTAR tenant manager, you can request a service account for your project via NeCTAR Support. In the service request, mention that you want a "robot account" and say which NeCTAR project it should be associated with. (A request for a robot account on a PT would be refused.)

Q2.5.1 - Why would I need a Service Account?

If you are running a service on NeCTAR instances, you may need to write scripts to run unattended that need to interact with OpenStack services.  For example, you might want your instance's nightly backup script to save a copy of the backup into Swift Object Storage.  When a script interacts with OpenStack, it needs to provide an identity and an OpenStack password. If you embed your own NeCTAR identity and password into a script then:

  • You are exposing your personal identity information to anyone with root access to the instance, or the ability to gain root access.
  • If you reset your personal password, you will break all of the scripts where you have embedded it.

Using a service account ameliorates these problems.

Q2.6 - What is QRIScompute Special Compute?

 As part of the QRIScloud Stage 2 equipment tranche, we purchased some compute nodes with extra resources.  These form the basis of our "Special Compute" services: "big memory" and "GPU".

You can request access to Special Compute facilities via the QRIScloud Portal Services page.

Q2.6.1 - What is available on the Big Memory compute nodes?

QRIScloud currently has 4 Big Memory compute nodes which each have 64 cores and 1 Terabyte of RAM. Allowing for overheads, we are able to provide special flavors with the following dimensions:

Flavor Memory VCPUs Primary disk Ephemeral disk
qld.mem-900G 900Gb 60 10GB 2048GB
qld.mem-510G 510Gb 30 10GB 1024GB
qld.mem-255G 255Gb 15 10GB 512GB

Q2.6.2 - What is available on the GPU nodes?

QRIScloud currently has 4 GPU compute nodes which are available with the following dimensions:

Flavor Memory VCPUs
Primary disk Ephemeral disk
qld.xlarge-gpu-k20m 32GB 8 (intel) 10GB 240GB
qld.xxlarge-gpu-k20m 32GB 8 (intel) 30GB 240GB

Each GPU node has one Tesla K20m card. (We have tried configurations with 2 cards in one node but the "PCI pass through" proved to be problematic with OpenStack.)

Q2.6.3 - How do I get access to QRIScloud Special Compute?

Use the QRIScloud Portal Services page to request access. If you are eligible, and your use-case is a good match for the facility, we will set up an instance for you to use.  We normally allocate in blocks of 2 weeks, and we can snapshot your instance while some other user gets to use the service.

Q2.7 - What are Nimrod and Kepler?

Nimrod is a framework for running "parameter sweep" computations. Kepler is a tool for designing and running computational workflows. Nimrod and Kepler can be combined in various ways for doing large-scale scientific computations.

Q2.7.1 - How do I access QRIScompute Nimrod and Kepler?

The QCIF Nimrod team can arrange to run Nimrod / Kepler based computations on NeCTAR, MASSIV and in other places. You can request access to these resources via the QRIScloud Portal Services page.

Q2.7.2 - Can I use other Nimrod / Kepler facilities?

The Terrestrial Ecosystems Research Network (TERN) project provided Nimrod and Kepler services; please read "Nimrod and Kepler Services on the CoESRA".

Q2.8 - What are Euramoo and Awoonga?

Euramoo was a "virtual" cluster implemented using NeCTAR OpenStack instances that is suitable for running low-end HPC jobs with modest CPU and memory requirements.  Euramoo has now been shut down.

The Intel compute hardware in Euramoo has been reconfigured as a cluster where the compute nodes are no longer virtualized  and renamed Awoonga. Euramoo now shares "/home", "/30days" and "/90days" file systems with Flashlite and the UQ-only Tinaroo system.

For more information on the HPC / HTC systems, please refer to Section 6 of the FAQs.

Q2.9 - Can QCIF provide access to genuine HPC facilities?

Yes:

  • The old QCIF-funded Barrine system has been decommissioned, except as a way to access files stored in Barrine's HSM storage.
  • QCIF manages a significant number of "shares" of resources in the NCI HPC facilities.
  • The QCIF-funded Flashlite "data intensive" HPC is now available.  This designed for batch jobs that require a lot of memory or high performance node-local file I/O.
  • The QCIF member universities have submitted a LIEF grant proposal for a new large general-purpose HPC system.  If this grant is approved, the facility should be open to all QCIF members' users.

We may also be able to help you with access to member university HPC facilities at UQ, QUT, USQ and CQU.

Q2.9.1 - How do I get access to QRIScompute HPC facilities?

  • You can register for access to Flashlite and Awoonga, via the QRIScloud Portal Services page.
  • You can request to use QCIF-controlled NCI shares via the QRIScloud Portal Services page.
  • For access to other HPC resources, please open a QRIScloud Support request and we will arrange for someone to discuss your needs and put you in contact with the appropriate group.

Q2.10 - What is QRIScomputeHA?

The QRIScomputeHA service provides applications with a higher level of availability and reliability for compute and storage.  It is designed for implementing long-lived research data portals, and similar applications that need higher level uptimes.

QRIScomputeHA resources are requested and managed through the Nectar dashboard. For more information, please refer to the "Getting Started Guide for QRIScomputeHA".  Before making an allocation request, you should be familiarize yourself with Nectar cloud basics and Nectar allocation management.  It is also advisable to "cut your teeth" using regular NeCTAR instances.

From a technical perspective, the HA cluster is part of the QRIScloud NeCTAR availability zone, and is substantially dependent on common NeCTAR services such as Keystone, Nova compute, Neutron networking, Swift and Glance.  The things that differentiate QRIScompute HA from "regular" QRIScloud NeCTAR compute are:

  • The QRIScomputeHA compute nodes use Ceph storage rather than compute node disks to hold the file systems for instances. This means that we can safely migrate instance between compute nodes.  This should avoid the need for  instance downtime during hypervisor upgrades, and should allow faster recovery from compute node hardware failure.
  • The QRIScomputeHA compute pool is separate from the main QRIScloud compute pool, and we reserve sufficient capacity to allow migration, recovery and (when required) placement of specified instances on different compute nodes.
  • The QRIScomputeHA volume storage pool is separate from the main QRIScloud volume storage pool.

There are some important caveats / reasons why we describe this as "higher availability" not "high availability":

  • QRIScomputeHA is dependent on network access to Polaris and within Polaris.This access is occasionally disrupted for various reasons.
  • The dependencies on regular NeCTAR services mean that disruption to NeCTAR can affect QRIScomputeHA.
  • Implementing a highly available service is more complicated than just using QRIScloud HA. You typically need design your system to include load balancer, redundant front-end servers and a high availability database or file system backend.  Doing this properly requires experience.
Have more questions? Submit a request

Comments

Powered by Zendesk