Section 6: Using Euramoo and Flashlite
Other sources of information:
- RCC's documentation for Euramoo and Flashlite
- QCIF's Euramoo user documentation can be found here.
- Some user documentation is available using "module help" on the respective systems. Running "module avail" to list the available documentation modules, and "module help <module-name>" to see the documentation; e.g. "module help README" on Euramoo.
Euramoo and Flashlite are managed compute services offered by QCIF and UQ RCC. They provide login accounts for individual users, varying amounts of "operational" file storage, and a PBS batch job submission system.
Both Euramoo and Flashlite have a significant amount of computational capacity (cores, etcetera). However, unlike NeCTAR, the model is that the user does not have exclusive use, or guaranteed capacity.
Flashlite is HPC of "high performance computing" system aimed at large-scale data intensive jobs that can benefit from large amounts of memory or large amounts of high-speed working "disk" storage. The hardware, the high-speed interconnect and the fact that Flashlite has a "base-metal" operating system means that jobs that MPI / OpenMPI jobs that span multiple compute nodes should work well.
Euramoo is aimed at job mixes where the individual jobs require 10 cores or less, a modest amount of memory (max 120Gb), and that don't have significant file I/O or inter-node I/O requirements. We refer to it as a HTC or "high throughput computing" system.
These systems are NOT suitable for running the following:
- Web "portals".
- Anything requiring guaranteed "instantly available" compute capacity.
- Extremely long-running computational tasks. There is a 2 week wall-clock time limit for jobs on Flashlite and Euramoo.
- Massively compute intensive (without being data intensive) tasks.
- Anything that involves running Windows applications.
The links for applying for Euramoo and Flashlite access are on the QRIScloud Services page. If you need to use both Euramoo and Flashlite, please make two separate applications. These resources have different approval procedures.
Euramoo and Flashlite access is available to researchers at QCIF member and affiliate organizations. However, access to Flashlite may be refused if your computational requirements are not a good fit for Flashlite's capabilities.
We would like you to provide the following:
- A brief description of your research from a scientific perspective.
- A brief technical description of the kind of computations that you are intending to perform, including typical core and memory requirements, indicative job counts and durations, and temporary file storage requirements.
- Any applications that you intend to use. (But see Q6.3 etc below.)
- Requirements to access "group" resources.
You can request the creation of a Linux "group" to facilitate sharing of files with co-researchers. We will also consider requests for a shared directory with an associated file space quota:
- to enable sharing research data, and
- to facilitate installation of shared resources such as software or reference databases.
This page gives a catalog of software applications that are (or will be) available on Euramoo.
In both Euramoo and Flashlite, you can find out what is currently available by logging in and running the "module avail" command. On Euramoo there are a couple of caveats:
- The modules are different on the AMD and Intel node types. To see what is available on Intel, you need to start an interactive job (see Q6.5.6) on an Intel node and run "module avail" there.
- Euramoo Biolinux nodes do not use "modules". To find out what is available there, refer to the software list on the Biolinux website.
In general, you are free to build and install free applications in your home directory, provided that they are appropriate for the system, and don't have licensing constraints.
If you run into problems, we may be able to answer questions. However, we would strongly encourage you to try to solve them yourself first; e..g. by reading the build / install documentation, and Googling any error messages.
Generally speaking, no. Neither QCIF or RCC has sufficient resources to provide a software installation service.
We encourage users to "self help" and to work together to install and maintain the applications that they use. We can facilitate this by providing file-spaces for people who want to be the "build master" for a group of applications.
First, check the following document to see if it answers your questions:
If not, then raise a QRIScloud support ticket, and we will advise if we are able.
The other places to look for help would be the software's support forum or mailing list (search the archives), and general web search; i.e. "google it".
Please lodge a QRIScloud support request, giving details of the software. The answer to this question will depend on a number of factors.
No. You cannot run Windows applications on Flashlite or Euramoo. The base operating system for all compute nodes is some version of Linux.
Euramoo has a mixture of Intel and AMD compute nodes, but the login nodes are AMD only. Currently, you need to be running on one of the Intel nodes to use the Intel compilers. The simple way to achieve this is to launch an "interactive job" (see Q6.5.6) with node type Intel, and load the Intel module.
You need to use SSH to login to Flashlite or Euramoo, supplying the username and password for your QSAC credentials. We recommend that you use the hostnames for the load-balancer nodes for logging in:
- Euramoo - "euramoo.qriscloud.org.au"
- Flashlite - "flashlite.rcc.uq.edu.au"
Users with a QSAC account name that matches their UQ account name can use their UQ password in place of their QSAC password when logging into Flashlite and Euramoo.
There a few reasons login might not be working. For example:
- You could be using the wrong username and password, or the wrong SSH key.
- For Flashlite (and Tinaroo), if your account has just been enabled or QSAC has just been created, they can take up to 2 hours to propagate. During that period, login may fail.
- Your access could be temporarily blocked because you (or someone else) has had too many failed attempts to login from "your" IP address. If you are using a network connection on your institution's internal network, the IP address of the institution's NAT is the one that matters.
- There could be an outage; check the QRIScloud Announcements page for details of any current outages.
There is more information on the QRIScloud Known Issues page.
Your QSAC password can be reset from the QRIScloud Portal via the "MyCredentials" page. If you are using your UQ password for logging in, refer to the UQ ITS website for instructions on how to change or reset that password,
Yes you can. Follow the normal Linux procedure for generating a keypair, and creating and securing a "~/.ssh/authorized-keys" file on Euramoo and / or Flashlite.
Use the PBS job submission system; see Q6.5.
Please do not run compute jobs on the Euramoo or Flashlite login nodes.
PBS is the generic name for the job scheduling system that is used on Euramoo, Flashlite and other HPC systems for managing the system's computational workload. Euramoo and Flashlite currently runs the Torque version of PBS. This differs in some respects from the PBSPro version that was originally used on Barrine.
Reference: Portable Batch System.
Job submission is using the "qsub" command. Typically, you write a PBS script that contains the job parameters and the commands the job will run. Then use "qsub" to submit the job to the queue. The job parameters include:
- The "-A" parameter that specifies the account string for your job.
- One or more "-l" parameters that specify the resources required for the job. These can include:
- a "walltime" parameter which puts an upper bound on the time that a job can run.
- "mem" or "vmem" parameters which specify how much memory / virtual memory a job requires.
- a "nodes" parameter that says how many nodes are required and how many cores each one should have.
- The name of the PBS script.
For more information on job submission, refer to the "qsub(1)" manual entry.
For hints on how to use qsub on Euramoo and Flashlite, run "qsub -I". This prints out the (brief) local usage information.
The account string determines which "account" your job usage is booked against. Valid account strings consist of letters and hyphens. The following table shows what valid account codes look like.
|Organization / Category||Systems||Account code pattern|
|UQ||Flashlite / Euramoo||UQ-XXX or UQ-XXX-XXX (not "qris-uq"!)|
|Other QRIScloud member organizations||Flashlite / Euramoo||qris-xxx|
|Other organizations||Flashlite / Euramoo||qris-other|
The account strings that you are authorized to use will correspond to one of the group names listed when you run the "groups" command on a login node. (Note that not all groups are valid account strings; check the table above.) If you are authorized to use more than one account string, please use the one that is most appropriate to the work you are currently doing.
You must specify the account string the "-A" parameter, either in the PBS script, or as a "qsub" command line.
Watch this space.
The Euramoo system is populated with compute nodes that support 3 distinct hardware types (AMD, Intel and Intel + GPU), and two distinct operating system types. PBS node types are used to select the kind of compute node to run your job on.
The current Euramoo node types are as follows:
|Euramoo||intel||42||10||~118Gb||Intel cores / CentOS 6.x|
|amd||48||8||~30Gb||AMD cores / CentOS 6.x|
|biolinux||7||10||~118Gb||Intel cores / Ubuntu 12.04 LTS / BioLinux 7|
|10||8||~30Gb||AMD cores / Ubuntu 12.04 LTS / BioLinux 7|
|gpu / render||4||16 + 1||~47Gb||Intel cores / CentOS 6.x + NVIDIA Tesla K20m GPU|
|Flashlite||n/a||Node types currently not selectable on Flashlite|
The Euramoo node types are subject to change, and the number of nodes is fluid. See also Q6.5.13.
On Flashlite, the hardware and OS are more uniform, and node types are not used for selecting where your job will be run. Instead, PBS determines where to run your jobs based on the requested amount of memory, number of cores and other attributes.
The PBS scheduler on Euramoo has been configured to not allow this. A single job running on multiple nodes would most likely entail significant amounts of inter-node communication. This kind of job is not suitable for Euramoo.
You are required to specify "-l nodes=1:..." as a resource parameter for PBS jobs on Euramoo.
It sounds like a contradiction in terms, but an interactive job is how you get an interactive shell on a compute node. For example, on Euramoo:
$ qsub -I -A UQ-RCC -l nodes=1:ppn=4:intel,mem=3gb,vmem=3gb,walltime=2:00:00
The above gives you an interactive shell on a node with 4 intel cores, 3Gb of memory and a time-limit of 2 hours. (Replace the account string with one that you are permitted to use!)
- The minimum and maximum permitted "walltime" resources are limited for an interactive job.
- You must specify both a "vmem" and "mem". (The "mem" and "vmem" requirement is due to an issue with the qsub filter for interactive jobs.)
- The node count must always be 1.
- It is not possible to submit a job from another job. This means that running "qsub" in an interactive job will fail.
At the moment we don't have a collection of examples, but we are working on it.
Running "qstat -a" tells you about any jobs of yours that are queued or running. The "E" file for your job will tell you whether a particular job has succeeded or failed, and what resources it actually used.
If you submit a job requesting more cores or memory than is available, you job is liable to get stuck in the queue indefinitely. Therefore, it is important to understand what is available.
The limits for Euramoo cores (processors) are straightforward. The table above lists the maximum number of CPUs available for each node type. You just .need to avoid requesting a "ppn" value that is greater than the number of CPUs.
The limits for Euramoo memory are more complicated. The nominal memory sizes listed above are the amount of physical RAM that is allocated to each compute node virtual machine. The VMs then use that memory for running the operating system itself, various system processes, and all processes launched as part of your PBS job. The issue is that the amount on memory used by the operating system and the system processes is variable.
When the batch scheduler decides whether a job can be run, it looks at the amount of memory and virtual memory that the nodes operating system reports as free. So for example, if you requested to run a job with "vmem=32g" on an "amd" node, the scheduler would wait until there was a node with that much virtual memory >>free<<. Given the normal overheads, that is unlikely to ever happen.
The solution is to allow for the overheads in your request. Conservatively, it is unlikely that the memory overheads will exceed 2gb, so if you ask for 30gb or less on an "amd" node or 18gb or less on an "intel" node, then your jobs should run.
The other possibility is that the system (Euramoo / Flashlite) is simply very busy. You can use the "qstat -q" command to get a summary of the numbers of jobs for each queue type. The "qstat -Q" command displays similar information. (Please refer to "man qstat" for more details.)
A final possibility is that the scheduler has put your jobs on hold. This sometimes when the system is busy and you have a job array queue. If you suspect that this is the problem, please raise a support ticket.
Sometimes a Euramoo or Flashlite compute node will go offline. When this happens, PBS will be unable to contact the "mon" service on the node to check the state of jobs, or perform control actions on them. The upshot is that "qdel" on a job on an offline node doesn't do anything.
There are many reasons that a Euramoo / Flashlite application could be crashing. Too many to enumerate. However, here are some starting points for diagnosing problems.
- Read and try to understand any error messages written to the job's "E" and "O" files, or to application-specific logfiles. There are often important clues in the message that point to the underlying problem.
- Check the application's documentation for hints on trouble-shooting.
- Google the error messages.
- If your application is crashing silently, one possible explanation is that it is attempting to use more memory than your job requested.
Under normal circumstances, if your SSH session to Euramoo is disconnected while you are running an interactive job, the job will be terminated immediately. (This could happen because you need to restart your computer, or because your computer has been unplugged or disconnected from the wireless network.)
If you think ahead, you can avoid this by using the "screen" utility as follows:
Login to a specific Euramoo login node; e.g. euramoo1.qriscloud.org.au. (Or euramoo2 or euramoo3.)
On the login node, start an interactive shell running under "screen"; for example
user@euramoo1:-> screen bash
From the screen session, use "qsub" to request / start your interactive job; for example
user@euramoo1:-> qsub -I -A ... -l ...
Do your work on the compute node.
When you are finished with the interactive job:
Exit the interactive job by typing "exit" or CTRL-D.
Exit the screen session by typing "exit" or CTRL-D ... again.
Logout from Euramoo by typing "exit" or CTRL-D ... again.
If you need to disconnect, or if your network connection drops out, you reconnect to your session as follows:
Login to the same Euramoo login node you ran "screen" on.
Reconnect to the screen session by running the following:
Obviously, if you leave it too long after disconnecting, your interactive job will have gone past its walltime and will have been killed. But if it is still running, you should now be reconnected to it.
Various details of the PBS resource limits and scheduler behaviour change from time to time. There are a couple of PBS commands that an ordinary user can use to get some insights into this:
- Running "qstat -q" and "qstat -Q" show queue summaries, and some of the relevant resource limits.
- Running "qmgr -c "print server" | less" shows the current PBS configuration parameters.
Unfortunately, a lot of the information displayed by those commands is difficult to understand, and there is little useful documentation to help you understand. Also, many aspects of PBS / scheduler behavior are controlled by other things: the above commands only show part of the picture.
Both Flashlite and Euramoo are set up with relatively small user home directories in "/home/<user>" and larger short-term filespaces in "/30days/<user>" and "/90days/<user>". Files in the short-term trees may be automatically deleted 30 or 90 days after creation; see Q6.6.9.
Each job also has access to a large amount of fast local scratch space while it is running. See below.
For more details, please refer to the Euramoo User Guide and the Flashlite User Guide.
We can arrange for larger shared file spaces to be set up to facilitate specific research projects; see Q6.2.3 above.
WARNING: there are NO BACKUPS for any user or project files on either Flashlite or Euramoo.
The single copy of your files on the disk is the only copy that there is:
- If you or an application that you run overwrites or deletes files, they are gone for ever.
- If the operators accidentally delete files, they are gone for ever.
- If a hacker breaks in and damage files, they are gone for ever.
- If there is a hard disk failure that is beyond the capabilities of the RAID system (or similar) to deal with, the affected files are gone forever.
- If there is a fire in the data centre, all of your Euramoo / Flashlite may be gone forever.
There are no "backup tapes" to recover from. No matter how nicely you ask us, and no matter how important files are to your research and/or your job security. Beware!
(There are a number of reasons why none of our HPC or HTC systems offer backups of user files. Running a backup system for a ~50TB file system requires a lot of resources, both in terms of media and I/O bandwidth. Getting a consistent backup of a file system where jobs are running 24/7 is impossible. Restoring files from backups is labor intensive.)
You can find out how much file space is available to you by running either "quota" or "/usr/local/bin/rquota". This can show you your own personal file quotas, and the quotas for project groups that you belong to. (Use the "-g <group>" option for group quotas.)
Note that there are distinct quotas for file space and file count.
Generally no. We discourage people from using Euramoo and Flashlite file systems for keeping large amounts of personal data. But see See Q6.2.3 above.
Generally no. The file count quotas are intended to stop people from organizing and storing their data as lots of tiny files. Doing that causes a range of problems, and is not conducive to efficient use of the computing resources.
We recommend using a lightweight database, a more sophisticated flat-file organization, or "zipping" the small files into ZIP or TAR archives.
Yes. See FAQ 6.2.3 above.
If you want other people in your team to be able to access your files, they should be created with the appropriate group and the appropriate access modes. The "chmod" and "chgrp" commands can be used to adjust the group and modes of your files, and the "newgrp" allows you to temporarily change your primary group.
(The access rules for collections work differently; see FAQ 6.7.1.)
Keep a copy of any data that you cannot afford to lose on a separate system that has regular backup procedures in place. We recommend that you talk to your local IT support staff about this. They may provide a place to store your files that will be properly backed up.
You can use the SSH-based file transfer protocols to upload and download files.
For file transfers, we recommend that you connect directly to the login nodes (euramoo1, flashlite1, etcetera) rather than connecting via the load balancers (euramoo, flashlite):
- The load balancers cannot sustain as high a data transfer rate as the login nodes,
- The load balancers have rate limiting in place to throttle number of parallel connections, and the rate at which new connections are established. This can interfere with bulk file transfers.
Each compute node on Euramoo and Flashlite has a significant amount of scratch file storage space that a job can use. Each job gets its own scratch area. Use $TMPDIR to access it.
Note: a job's scratch area is cleaned automatically when the job ends. If there are files that need to be saved, you job script should copy them to one other other file spaces.
In some cases, yes. In other cases, no. Read on ...
You can access a QRISdata collections from Euramoo and Flashlite.
- You can use SSH-enabled file transfer utilities like "scp", "sftp" or "rsync" to transfer files too and from a QRISdata RDSI collection. We recommend that you transfer files to your "/30day" or "/90day" directory, or to local scratch space on the compute notes themselves (if applicable).
- Euramoo and Flashlite users now have direct access to QRISdata collections that are configured for "Standard Access". The collection tree will be available as "/RDS/Qnnnn" where "Qnnnn" is the collection's ID. You need to request permission to access a collection from the respective collection owner.
In theory, you could access NeCTAR (or other) Object Storage. If you need to do this, please contact QRIScloud Support for advice on how to proceed.
You cannot directly access regular NeCTAR (or other) Volume Storage from Euramoo or Flashlite. If you have data in QRIScloud-based volume storage that you need to access, we recommend that you attach the volume to a NeCTAR instance, and then use an SSH-based file transfer tool to copy files between the instance and Euramoo or Flashlite.
Medici is a (currently) experimental project which aims to make the same files available transparently in multiple places. The basic idea is to cache copies of files near to where they are needed, and use highspeed networking to copy files and keep them in sync.
When Medici becomes a production system, it will allow Flashlite and Euramoo to access files stored in various centres in UQ St Lucia, and at other Queensland Universities.
UQ Research Computing Centre (RCC) runs free HPC training courses for Flashlite and Euramoo users on the UQ St Lucia Campus. These are open to all Flashlite and Euramoo users.
- RCC Euramoo introductory training sessions are normally held on the last Friday of each month, from 2pm to 5pm.
- Other RCC HPC training sessions are advertised in the RCC Training page.
- Please contact <firstname.lastname@example.org> to register for a training session.
Unless you are specifically authorized to do this, the answer is No. If you need to use software that requires this, please contact QRIScloud support.
Barrine has been the "workhorse" HPC system for the University of Queensland since 2010. The hardware is now off-maintenance and it is not cost effective to continue running it. The system being run in a degraded mode until the end of 2015 and RCC intends to decommission it in early 2016.
**UPDATE** - As of late Feb 2016, Barrine no longer runs jobs, but there is still a login node for retrieving user files stored on the system. User files have all be moved to HSM.
The replacement for Barrine is a 6,000 core HPC system called Tinaroo. Please refer to the RCC Tinaroo page for more details, and instructions on how to apply for access. Access is restricted to UQ students and staff only.
We have an arrangement with Mathworks to allow users of Euramoo to make use of their home institutions' Matlab site license on Euramoo. So far, this has been implemented for University of Queensland, University of Southern Queensland, Griffith University, and James Cook University. (For others, your institution needs to provide us access to their Matlab license server.)
For Flashlite (and Tinaroo), only UQ users are currently able to use Matlab.
To use Matlab interactively, you should:
- Start an interactive job (see Q6.5.6) on an Intel or AMD node. (We recommend you request at least 4gb of memory to run matlab.)
- Use "module load matlab" to load Matlab. (If you need a specific version, use "module avail" to check if it is available, and select it specifically.)
- Run the "matlab" command.
All Euramoo and Flashlite users can run compiled Matlab using the "mcr" command.
(There is an approved way to get an "interactive" session on a compute node; see Q6.5.6. We are talking about something different here. We are talking about how to investigate what a non-interactive job is doing.)
It is technically possible to connect to a compute, but we discourage it ... unless you are doing it for the right reasons:
- If you do not have a batch or interactive PBS job currently running on a compute node, then it is NOT OK to connect to it.
- If you do have a job running on a node, then:
- it is OK to connect to the node to check its progress
- at a pinch, it is OK to run other commands with significant resource demands ... provided that your PBS job has "exclusive" use of the node.
When you connect to a compute node directly, there is an elevated risk that what you do on the node will interfere with other users' jobs:
- If you run compute intensive tasks that PBS doesn't know about, that could "steal" CPU cycles from other users, and cause their jobs to exceed their wall-time.
- If you run memory hungry processes that cause the node to get short of memory, it could trigger the "OOM killer" to kill processes that belong to other people's jobs.
In order to connect to a compute node, you need to use a public / private key to enable SSH from your account on a login node to the compute nodes; see Q6.8.7 for instructions. Then Q6.8.8 explains how to connect.
This assumes that you have not configured your Euramoo (or Flashlite) account to allow SSH access.
- Login to a Euramoo / Flashlite compute node.
- Make sure that your home directory cannot be modified by anyone else.
$ chmod go-w ~
- Create a ".ssh" directory in your home directory, and make it private. Nobody else should be able to read or write it.
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
- Generate an SSH keypair.
$ cd ~/.ssh
$ ssh-keygen -t rsa -C "$USER - internode key" -f id_rsa
- Copy the public key to your "authorized keys" file.
$ cat id_rsa.pub >> .authorized_keys
$ chmod 700 .authorized_keys
WARNING: DO NOT copy the SSH keypair that you generated (as above) to another system. This keypair should ONLY be used for SSH >within< the specific cluster on which you generated it.
NOTE: If you get the permissions on "~", "~/.ssh" and/or the files in "~/.ssh" incorrect, the SSH client or server are liable to prevent you from using the keys to login. (Typically, they won't tell you why!)
Assuming that you have set up your SSH keys correctly (see Q6.8.7), then the procedure for logging into a compute node is as follows:
- SSH to a Euramoo or Flashlite login node in the normal way.
- On the login node, identify the hostname of the compute node where your job is (or was) running. For example, the following can be used to lookup the hostname for a running job with job id "123":
$ qstat -f 123 | grep exec_host exec_host = eura51-intel/9The hostname consists of the characters before the slash.
- On the login node, use "ssh" to connect to the compute node. For example:
$ ssh eura51-intelIf you have multiple keys in your "~/.ssh" directory, you can use "-i" to select a specific key:
$ ssh -i ~/.ssh/id_rsa eura51-intel
Note: you should only try to login to a compute node from a login node. Login to a compute node from outside of the cluster (Flashlite or Euramoo respectively) is blocked by the cluster firewall rules.
If you have a passphrase on your ssh key, then it may be convenient to use "ssh-agent" so that you only need to unlock your key once at the start of your Euramoo / Flashlite session:
$ ssh-agent bash
$ ssh-add <private-key-pathname>
You will be prompted for the passphrase when you add the key to the agent.
The short answer is to use SSH X11 tunneling and forwarding. Here are some generic instructions on how to do this:
You will also need X11 server software installed and configured on the machine where you are displaying your results; e.g. your PC, Laptop, Workstation or whatever.
- For Mac OSX, X11 support is pre-installed and configured.
- For Linux systems running a standard Desktop environment (Gnome, KDE, etc), X11 support is pre-installed and configured. (If you are not running a Desktop environment, this simple solution is to install one.)
- For Windows there are various alternatives: for example Xming, Cygwin/X, MobaXterm, XWin32, VcXsrv.