Section 6: Using Awoonga, Flashlite and Tinaroo
Other sources of information:
- RCC's documentation for Awoonga, Flashlite and Tinaroo.
- QCIF's Awoonga User Guide can be found here.
- Some user documentation is available using "module help" on the respective systems. Running "module avail" to list the available documentation modules, and "module help <module-name>" to see the documentation; e.g. "module help README".
Q6.1 - What are Awoonga, Flashlite and Tinaroo?
Awoonga and Flashlite are managed compute services offered by QCIF and UQ RCC. Tinaroo is a similar (but larger) system that is operated by UQ RCC exclusively for UQ members. All three systems provide login accounts for individual users, varying amounts of "operational" file storage, and a PBS batch job submission system.
All three systems have a significant amount of computational capacity (cores, etcetera). However, unlike NeCTAR, the model is that the user does not have exclusive use, or guaranteed capacity
All three systems have access to QRICloud RDS collections in Standard Access mode.
Flashlite is an HPC of "high performance computing" system aimed at large-scale data intensive jobs that can benefit from large amounts of memory or large amounts of high-speed working "disk" storage. Flashlite has high-speed networking which means that MPI / OpenMPI jobs spanning multiple compute nodes should perform well.
Awoonga is aimed at job mixes where the individual jobs require 24 cores or less and 254GB of RAM or less. The compute nodes have relatively small / relatively slow "scratch" file system, and low-speed networking. It does not support MPI spanning multiple jobs.
We often refer to Awoonga as a HTC or "high throughput computing" system rather than an HPC system.
Tinaroo is an HPC system that is similar to Awoonga (i.e. 24 core / 254GB nodes) except that it has fast networking, faster processors, and more compute nodes. This is the preferred system for large-scale compute-intensive applications that can exploit MPI beyond 24 cores.
These systems are NOT suitable for running the following:
- Web "portals".
- Anything requiring guaranteed "instantly available" compute capacity.
- Extremely long-running computational tasks. There is a 2 week wall-clock time limit for jobs on all HPC systems.
- Anything that involves running Windows applications.
The links for applying for Awoonga and Flashlite access are on the QRIScloud Services page. If you need to use both Awoonga and Flashlite, please make two separate applications. These resources have different approval procedures.
The link for applying for Tinaroo access (UQ only!) is on the RCC website.
Awoonga and Flashlite access can be granted to researchers at QCIF member and affiliate organizations. However, Flashlite access is limited to people who can demonstrate that their application needs Flashlite's special capabilities.
Tinaroo access is limited to UQ staff, students and affiliates.
We would like you to provide the following:
- A brief description of your research from a scientific perspective.
- A brief technical description of the kind of computations that you are intending to perform, including typical core and memory requirements, indicative job counts and durations, and temporary file storage requirements.
- Any applications that you intend to use. (But see Q6.3 etc below.)
- Requirements to access "group" resources.
You can request the creation of a Linux "group" to facilitate sharing of files with co-researchers. We will also consider requests for a shared directory with an associated file space quota:
- to enable sharing research data, and
- to facilitate installation of shared resources such as software or reference databases.
However, we do not normally provide "shared" accounts for running jobs on HPC, and we do not condone unauthorized sharing; e.g. a research assistant using his or her professor's account to run more jobs at a time.
You can find out what applications are currently available using the "module avail" command. We do not maintain a catalog.
In general, you are free to build and install free applications in your home directory, provided that they are appropriate for the system, and don't have licensing constraints.
If you run into problems, we may be able to answer questions. However, we would strongly encourage you to try to solve them yourself first; e..g. by reading the build / install documentation, and Googling any error messages.
Generally speaking, no. Neither QCIF or RCC has sufficient resources to provide a software installation service.
We encourage users to "self help" and to work together to install and maintain the applications that they use. We can facilitate this by providing file-spaces for people who want to be the "build master" for a group of applications.
First, check the following document to see if it answers your questions:
If not, then raise a QRIScloud support ticket, and we will advise if we are able.
The other places to look for help would be the software's support forum or mailing list (search the archives), and general web search; i.e. "google it".
Please lodge a QRIScloud support request, giving details of the software. The answer to this question will depend on a number of factors.
No. You cannot run Windows applications on these HPC systems. These systems are all Linux based.
You need to use SSH to login to Flashlite or Euramoo, supplying the username and password for your QSAC credentials. We recommend that you use the hostnames for the load-balancer nodes for logging in:
- Euramoo - "euramoo.qriscloud.org.au"
- Flashlite - "flashlite.rcc.uq.edu.au"
Users with a QSAC account name that matches their UQ account name can use their UQ password in place of their QSAC password when logging into Flashlite and Euramoo.
There a few reasons login might not be working. For example:
- You could be using the wrong username and password, or the wrong SSH key.
- For Flashlite (and Tinaroo), if your account has just been enabled or QSAC has just been created, they can take up to 2 hours to propagate. During that period, login may fail.
- Your access could be temporarily blocked because you (or someone else) has had too many failed attempts to login from "your" IP address. If you are using a network connection on your institution's internal network, the IP address of the institution's NAT is the one that matters.
- There could be an outage; check the QRIScloud Announcements page for details of any current outages.
There is more information on the QRIScloud Known Issues page.
Your QSAC password can be reset from the QRIScloud Portal via the "MyCredentials" page. It is not possible to set the password to something more memorable.
If you are using your UQ password to login, refer to the UQ ITS website for instructions on how to change or reset that password,
Yes you can. Follow the normal Linux procedure for generating a keypair, and creating and securing a "~/.ssh/authorized-keys" file on the HPC systems.
Use the PBS job submission system; see Q6.5.
Please do not run compute jobs on the HPC login nodes.
PBS is the generic name for the job scheduling system that is used on Awoonga, Flashlite and Tinaroo for managing the system's computational workload.
There are three versions of PBS
- OpenPBS is the original open source version which is not actively maintained.
- TORQUE is a fork of OpenPBS that was developed and maintained by Adaptive Computing Enterprises, Inc. (formerly Cluster Resources, Inc.)
- PBS Pro is the most recent version. Until recently, it was only available under a commercial license. This has now changed and PBS Pro is dual licensed.
As of November 2017, Awoonga, Flashlite and Tinaroo used to run TORQUE, but are in the process of being switched to using PBS Pro.
Job submission is using the "qsub" command. Typically, you write a PBS script that contains the job parameters and the commands the job will run. Then use "qsub" to submit the job to the queue. The job parameters include:
- The "-A" parameter that specifies the account string for your job.
- One or more "-l" parameters that specify the resources required for the job. These can include:
- a "walltime" parameter which puts an upper bound on the time that a job can run.
- "mem" or "vmem" parameters which specify how much memory / virtual memory a job requires.
- a "nodes" parameter that says how many nodes are required and how many cores each one should have.
- The name of the PBS script.
For more information on job submission, refer to the "qsub(1)" manual entry.
For hints on how to use qsub on Euramoo and Flashlite, run "qsub -I". This prints out the (brief) local usage information.
The account string determines which "account" your job usage is booked against. Valid account strings consist of letters and hyphens. The following table shows what valid account codes look like.
|Organization / Category||Systems||Account code pattern|
|UQ||Flashlite / Euramoo||UQ-XXX or UQ-XXX-XXX (not "qris-uq"!)|
|Other QRIScloud member organizations||Flashlite / Euramoo||qris-xxx|
|Other organizations||Flashlite / Euramoo||qris-other|
The account strings that you are authorized to use will correspond to one of the group names listed when you run the "groups" command on a login node. (Note that not all groups are valid account strings; check the table above.) If you are authorized to use more than one account string, please use the one that is most appropriate to the work you are currently doing.
You must specify the account string the "-A" parameter, either in the PBS script, or as a "qsub" command line.
Watch this space.
The Euramoo system is populated with compute nodes that support 3 distinct hardware types (AMD, Intel and Intel + GPU), and two distinct operating system types. PBS node types are used to select the kind of compute node to run your job on.
The current Euramoo node types are as follows:
|Euramoo||intel||42||10||~118Gb||Intel cores / CentOS 6.x|
|amd||48||8||~30Gb||AMD cores / CentOS 6.x|
|biolinux||7||10||~118Gb||Intel cores / Ubuntu 12.04 LTS / BioLinux 7|
|10||8||~30Gb||AMD cores / Ubuntu 12.04 LTS / BioLinux 7|
|gpu / render||4||16 + 1||~47Gb||Intel cores / CentOS 6.x + NVIDIA Tesla K20m GPU|
|Flashlite||n/a||Node types currently not selectable on Flashlite|
The Euramoo node types are subject to change, and the number of nodes is fluid. See also Q6.5.13.
On Flashlite, the hardware and OS are more uniform, and node types are not used for selecting where your job will be run. Instead, PBS determines where to run your jobs based on the requested amount of memory, number of cores and other attributes.
The PBS scheduler on Euramoo has been configured to not allow this. A single job running on multiple nodes would most likely entail significant amounts of inter-node communication. This kind of job is not suitable for Euramoo.
You are required to specify "-l nodes=1:..." as a resource parameter for PBS jobs on Euramoo.
It sounds like a contradiction in terms, but an interactive job is how you get an interactive shell on a compute node. For example, on Euramoo:
$ qsub -I -A UQ-RCC -l nodes=1:ppn=4:intel,mem=3gb,vmem=3gb,walltime=2:00:00
The above gives you an interactive shell on a node with 4 intel cores, 3Gb of memory and a time-limit of 2 hours. (Replace the account string with one that you are permitted to use!)
- The minimum and maximum permitted "walltime" resources are limited for an interactive job.
- You must specify both a "vmem" and "mem". (The "mem" and "vmem" requirement is due to an issue with the qsub filter for interactive jobs.)
- The node count must always be 1.
- It is not possible to submit a job from another job. This means that running "qsub" in an interactive job will fail.
At the moment we don't have a collection of examples, but we are working on it.
Running "qstat -a" tells you about any jobs of yours that are queued or running. The "E" file for your job will tell you whether a particular job has succeeded or failed, and what resources it actually used.
If you submit a job requesting more cores or memory than is available, you job is liable to get stuck in the queue indefinitely. Therefore, it is important to understand what is available.
The limits for Euramoo cores (processors) are straightforward. The table above lists the maximum number of CPUs available for each node type. You just .need to avoid requesting a "ppn" value that is greater than the number of CPUs.
The limits for Euramoo memory are more complicated. The nominal memory sizes listed above are the amount of physical RAM that is allocated to each compute node virtual machine. The VMs then use that memory for running the operating system itself, various system processes, and all processes launched as part of your PBS job. The issue is that the amount on memory used by the operating system and the system processes is variable.
When the batch scheduler decides whether a job can be run, it looks at the amount of memory and virtual memory that the nodes operating system reports as free. So for example, if you requested to run a job with "vmem=32g" on an "amd" node, the scheduler would wait until there was a node with that much virtual memory >>free<<. Given the normal overheads, that is unlikely to ever happen.
The solution is to allow for the overheads in your request. Conservatively, it is unlikely that the memory overheads will exceed 2gb, so if you ask for 30gb or less on an "amd" node or 18gb or less on an "intel" node, then your jobs should run.
The other possibility is that the system (Euramoo / Flashlite) is simply very busy. You can use the "qstat -q" command to get a summary of the numbers of jobs for each queue type. The "qstat -Q" command displays similar information. (Please refer to "man qstat" for more details.)
A final possibility is that the scheduler has put your jobs on hold. This sometimes when the system is busy and you have a job array queue. If you suspect that this is the problem, please raise a support ticket.
Sometimes a Euramoo or Flashlite compute node will go offline. When this happens, PBS will be unable to contact the "mon" service on the node to check the state of jobs, or perform control actions on them. The upshot is that "qdel" on a job on an offline node doesn't do anything.
There are many reasons that a Euramoo / Flashlite application could be crashing. Too many to enumerate. However, here are some starting points for diagnosing problems.
- Read and try to understand any error messages written to the job's "E" and "O" files, or to application-specific logfiles. There are often important clues in the message that point to the underlying problem.
- Check the application's documentation for hints on trouble-shooting.
- Google the error messages.
- If your application is crashing silently, one possible explanation is that it is attempting to use more memory than your job requested.
Under normal circumstances, if your SSH session to Euramoo is disconnected while you are running an interactive job, the job will be terminated immediately. (This could happen because you need to restart your computer, or because your computer has been unplugged or disconnected from the wireless network.)
If you think ahead, you can avoid this by using the "screen" utility as follows:
Login to a specific Euramoo login node; e.g. euramoo1.qriscloud.org.au. (Or euramoo2 or euramoo3.)
On the login node, start an interactive shell running under "screen"; for example
user@euramoo1:-> screen bash
From the screen session, use "qsub" to request / start your interactive job; for example
user@euramoo1:-> qsub -I -A ... -l ...
Do your work on the compute node.
When you are finished with the interactive job:
Exit the interactive job by typing "exit" or CTRL-D.
Exit the screen session by typing "exit" or CTRL-D ... again.
Logout from Euramoo by typing "exit" or CTRL-D ... again.
If you need to disconnect, or if your network connection drops out, you reconnect to your session as follows:
Login to the same Euramoo login node you ran "screen" on.
Reconnect to the screen session by running the following:
Obviously, if you leave it too long after disconnecting, your interactive job will have gone past its walltime and will have been killed. But if it is still running, you should now be reconnected to it.
Various details of the PBS resource limits and scheduler behaviour change from time to time. There are a couple of PBS commands that an ordinary user can use to get some insights into this:
- Running "qstat -q" and "qstat -Q" show queue summaries, and some of the relevant resource limits.
- Running "qmgr -c "print server" | less" shows the current PBS configuration parameters.
Unfortunately, a lot of the information displayed by those commands is difficult to understand, and there is little useful documentation to help you understand. Also, many aspects of PBS / scheduler behavior are controlled by other things: the above commands only show part of the picture.
Both Flashlite and Euramoo are set up with relatively small user home directories in "/home/<user>" and larger short-term filespaces in "/30days/<user>" and "/90days/<user>". Files in the short-term trees may be automatically deleted 30 or 90 days after creation; see Q6.6.9.
Each job also has access to a large amount of fast local scratch space while it is running. See below.
For more details, please refer to the Euramoo User Guide and the Flashlite User Guide.
We can arrange for larger shared file spaces to be set up to facilitate specific research projects; see Q6.2.3 above.
WARNING: there are NO BACKUPS for any user or project files on either Flashlite or Euramoo.
The single copy of your files on the disk is the only copy that there is:
- If you or an application that you run overwrites or deletes files, they are gone for ever.
- If the operators accidentally delete files, they are gone for ever.
- If a hacker breaks in and damage files, they are gone for ever.
- If there is a hard disk failure that is beyond the capabilities of the RAID system (or similar) to deal with, the affected files are gone forever.
- If there is a fire in the data centre, all of your Euramoo / Flashlite may be gone forever.
There are no "backup tapes" to recover from. No matter how nicely you ask us, and no matter how important files are to your research and/or your job security. Beware!
(There are a number of reasons why none of our HPC or HTC systems offer backups of user files. Running a backup system for a ~50TB file system requires a lot of resources, both in terms of media and I/O bandwidth. Getting a consistent backup of a file system where jobs are running 24/7 is impossible. Restoring files from backups is labor intensive.)
You can find out how much file space is available to you by running either "quota" or "/usr/local/bin/rquota". This can show you your own personal file quotas, and the quotas for project groups that you belong to. (Use the "-g <group>" option for group quotas.)
Note that there are distinct quotas for file space and file count.
Generally no. We discourage people from using Euramoo and Flashlite file systems for keeping large amounts of personal data. But see See Q6.2.3 above.
Generally no. The file count quotas are intended to stop people from organizing and storing their data as lots of tiny files. Doing that causes a range of problems, and is not conducive to efficient use of the computing resources.
We recommend using a lightweight database, a more sophisticated flat-file organization, or "zipping" the small files into ZIP or TAR archives.
Yes. See FAQ 6.2.3 above.
If you want other people in your team to be able to access your files, they should be created with the appropriate group and the appropriate access modes. The "chmod" and "chgrp" commands can be used to adjust the group and modes of your files, and the "newgrp" allows you to temporarily change your primary group.
(The access rules for collections work differently; see FAQ 6.7.1.)
Keep a copy of any data that you cannot afford to lose on a separate system that has regular backup procedures in place. We recommend that you talk to your local IT support staff about this. They may provide a place to store your files that will be properly backed up.
You can use the SSH-based file transfer protocols to upload and download files.
For file transfers, we recommend that you connect directly to the login nodes (euramoo1, flashlite1, etcetera) rather than connecting via the load balancers (euramoo, flashlite):
- The load balancers cannot sustain as high a data transfer rate as the login nodes,
- The load balancers have rate limiting in place to throttle number of parallel connections, and the rate at which new connections are established. This can interfere with bulk file transfers.
Each compute node on has a significant amount of scratch file storage space that a job can use. Each job gets its own scratch area. You should use $TMPDIR to access it.
- A job's scratch area is cleaned automatically when the job ends. If there are files that need to be saved, you job script should copy them to one of your other directories before it exits.
- The size and performance characteristics of the scratch area vary for the different HPC systems. You can find out how much space is currently available using (for example) "df -h $TMPDIR". If you need a huge amount of scratch storage, or if you need it to be very fast, please raise a support ticket.
- Avoid filling up the scratch of a compute node. It is likely to make your job fail, and possibly make other peoples' jobs fail too. (The scratch area is shared and has no quotas.) If you suspect that you might fill up scratch space, we would prefer you to use "qsub" resources that will cause your job to request a complete compute node.
- Do not write temporary files to other places on a compute node. You are liable to cause problems for the operators and other users.
In some cases, yes. In other cases, no. Read on ...
You can access a QRISdata collections from HPC.
- You can use SSH-enabled file transfer utilities like "scp", "sftp" or "rsync" to transfer files too and from a QRISdata RDSI collection.
- HPC users have direct access to QRISdata collections that have been configured for "Standard Access". Each collection is auto-mounted as "/RDS/Qnnnn" or "/QRISdata/Qnnnn" where "Qnnnn" is the collection's ID. (Typically "/RDS" is a symbolic link to "/QRISdata".)
- In either case, you may need to request permission to access a collection from the respective collection owner.
We strongly discourage people from "computing" against files in a RDSI collection, or using collections as "scratch" storage space. Instead, we recommend that you copy files from your collection to your "/30day" or "/90day" directory, or to local scratch space on the compute notes themselves. Only copy files back to your collection that need to be kept. If you have lots of little files, archive them (e.g. using "tar" or "zip") before copying them back.
In theory, you could access NeCTAR (or other) Object Storage from HPC. If you need to do this, please contact QRIScloud Support for advice on how to proceed.
You cannot directly access regular NeCTAR (or other) Volume Storage from Awoonga or Flashlite. If you have data in QRIScloud-based volume storage that you need to access, we recommend that you attach the volume to a NeCTAR instance, and then use an SSH-based file transfer tool to copy files between the instance and Awoonga or Flashlite.
Medici is a "data fabric" projects which aims to make the same files available transparently in multiple places. The basic idea is that copies of files are cached near to where they are needed, and high speed networking is used to move files between caches and keep them in sync.
Medici allows HPC users to access files held in various storage clusters in UQ St Lucia, and ultimately at other Queensland Universities. From the user perspective, Medici files are just files in a QRISdata collection, where that collection resides on a GPFS-based file system.
As mentioned above, Standard Access QRISdata collections are mounted on the HPC systems as the "/QRISdata/Qxxxx". If you use a collection inappropriately from HPC, it is possible to trigger a variety of "operational problems" that affect all users.
The technical problem is that the collections are essentially huge file systems that contain a vast number of files, and replicate them to offline and offsite media to protect against data loss. To do this economically, we store the files on large-scale infrastructure (file servers, storage clusters, tape silos) that is shared by thousands of users. The infrastructure is designed on the assumption that most of the data doesn't change most of the time.
The two main concerns are:
- Users doing things that place a lot of load on the front-end file servers; e.g. reading or writing lots of files in a short space of time.
- Users doing things that trigger unnecessary file replication to the back-end tape and off-site systems. This causes load on the tape systems AND wastes tape space.
The problem is that the HPC systems can easily amplify poor practices to a level that it impacts performance for all users. The HPC compute nodes have fast I/O paths to the file servers, and users have the ability to submit arrays of jobs that all do the same thing.
Here are some basic rules.
- Don't use "/QRISdata/Qxxxx" for scratch space. Ever.
- Don't unpack archives into a directory within "/QRISdata/Qxxxx". Unpack them into a scratch file system.
- Don't create lots of little files in a collection. If your application reads and writes lots of little file, then you need store them in the collection as archives (e.g. ZIP or TAR files), and write your job scripts to do the packing / unpacking into scratch space.
- Avoid computing against files in collections. If your application is file I/O intensive, stage the necessary files to HPC hosted storage; "/30days", "/90days" or scratch space on the compute nodes.
- Avoid repeatedly writing / rewriting files. Each write could trigger re-replication of the files.
- Avoid "reorganizing" and renaming directories with lots of files in them. Renaming a directory may trigger re-replication of all files "beneath" it.
For more information, please read the QRIScloud Collection Dos and Don'ts document.
RCC asks that Tinaroo support requests be emailed to <firstname.lastname@example.org>. However, we will pass on any Tinaroo support requests sent via QRIScloud Support channels.
UQ Research Computing Centre (RCC) runs free on the UQ St Lucia Campus once a month. These are open to all HPC users.
- The times for RCC HPC introductory training sessions are advertised in the RCC Training page.
- Please contact <email@example.com> to register for a training session.
Unless you are specifically authorized to do this, the answer is No. If you need to use software that requires this, please contact QRIScloud support, or RCC support.
Barrine was the "workhorse" HPC system for the University of Queensland from 2010 to 2015. It was decommissioned in early 2016. There is still a login node for retrieving user files stored on the system. User files have all be moved to HSM.
Euramoo was originally intended as an interim replacement for Barrine. It was a "cluster in the cloud" system implemented using a hybrid of Intel and AMD hardware and hybrid operating systems running on OpenStack-managed virtual machines. It proved to be more difficult to manage than conventional HPC and it was rebuilt as a smaller "bare metal" cluster (without the AMD cores) and renamed as Awoonga.
We have an arrangement with Mathworks to allow users of Awoonga to make use of their home institutions' Matlab site license on Awoonga. So far, this has been implemented for University of Queensland, University of Southern Queensland, Griffith University, and James Cook University. (For others, your institution needs to provide us access to their Matlab license server.)
For Flashlite and Tinaroo, only UQ users are currently able to use Matlab.
To use Matlab interactively, you should:
- Start an interactive job (see Q6.5.6) on an Intel or AMD node. (We recommend you request at least 4gb of memory to run matlab.)
- Use "module load matlab" to load Matlab. (If you need a specific version, use "module avail" to check if it is available, and select it specifically.)
- Run the "matlab" command.
Note that different versions of Matlab are available depending on licensing.
All HPC users can run compiled Matlab using the "mcr" command as this does not require access to the license servers.
(There is an approved way to get an "interactive" session on a compute node; see Q6.5.6. We are talking about something different here. We are talking about how to investigate what a non-interactive job is doing.)
It is technically possible to connect to a compute, but we discourage it ... unless you are doing it for the right reasons:
- If you do not have a batch or interactive PBS job currently running on a compute node, then it is NOT OK to connect to it.
- If you do have a job running on a node, then:
- it is OK to connect to the node to check its progress
- at a pinch, it is OK to run other commands with significant resource demands ... provided that your PBS job has "exclusive" use of the node.
When you connect to a compute node directly, there is an elevated risk that what you do on the node will interfere with other users' jobs:
- If you run compute intensive tasks that PBS doesn't know about, that could "steal" CPU cycles from other users, and cause their jobs to exceed their wall-time.
- If you run memory hungry processes that cause the node to get short of memory, it could trigger the "OOM killer" to kill processes that belong to other people's jobs.
In order to connect to a compute node, you need to use a public / private key to enable SSH from your account on a login node to the compute nodes; see Q6.8.7 for instructions. Then Q6.8.8 explains how to connect.
This assumes that you have not configured your HPC account to allow SSH access.
- SSH to an HPC login node in the normal way
- Make sure that your home directory cannot be modified by anyone else.
$ chmod go-w ~
- Create a ".ssh" directory in your home directory, and make it private. Nobody else should be able to read or write it.
$ mkdir ~/.ssh
$ chmod 700 ~/.ssh
- Generate an SSH keypair.
$ cd ~/.ssh
$ ssh-keygen -t rsa -C "$USER - internode key" -f id_rsa
- Copy the public key to your "authorized keys" file.
$ cat id_rsa.pub >> .authorized_keys
$ chmod 700 .authorized_keys
WARNING: DO NOT copy the SSH keypair that you generated (as above) to another system. This keypair should ONLY be used for SSH >within< the specific cluster on which you generated it.
NOTE: If you get the permissions on "~", "~/.ssh" and/or the files in "~/.ssh" incorrect, the SSH client or server are liable to prevent you from using the keys to login. (Typically, they won't tell you why!)
Assuming that you have set up your SSH keys correctly (see Q6.8.7), then the procedure for logging into a compute node is as follows:
- SSH to a login node in the normal way.
- On the login node, identify the hostname of the compute node where your job is (or was) running. For example, the following can be used to lookup the hostname for a running job with job id "123":
$ qstat -f 123 | grep exec_host exec_host = eura51-intel/9The hostname consists of the characters before the slash.
- On the login node, use "ssh" to connect to the compute node. For example:
$ ssh eura51-intelIf you have multiple keys in your "~/.ssh" directory, you can use "-i" to select a specific key:
$ ssh -i ~/.ssh/id_rsa eura51-intel
Note: you should only try to login to a compute node from a login node. Login to a compute node from outside of the HPC cusreester is blocked by the cluster firewall rules.
If you have a passphrase on your ssh key, then it may be convenient to use "ssh-agent" so that you only need to unlock your key once at the start of your HPC session:
$ ssh-agent bash
$ ssh-add <private-key-pathname>
You will be prompted for the passphrase when you add the key to the agent.
The short answer is to use SSH X11 tunneling and forwarding.
yourPC$ ssh -X user@hpc
theHPC$ qsub -X -I ...
First of all, you need X11 server software installed and configured on the machine where you are displaying your results; e.g. your PC, Laptop, Workstation or whatever.
- For Mac OSX, X11 support is pre-installed and configured.
- For Linux systems running a standard Desktop environment (Gnome, KDE, etc), X11 support is pre-installed and configured. (If you are not running a Desktop environment, this simple solution is to install one.)
- For Windows there are various alternatives: for example Xming, Cygwin/X, MobaXterm, XWin32, VcXsrv.
The next step is to enable X11-forwarding when you start the SSH connection from your computer to the HPC login node:
- If you are using the "ssh" command, include the "-X" option on the command line.
- If you are using PutTTY, use the PuTTY Configuration dialog and check (tick) the "Enable X11 Forwarding" option in "Connection > SSH > X11".
That gives the ability to run an X11 application on the login node. (You can test it by running "xclock" or "xeyes" on the login node, and see the corresponding window appear on your computer. Use "^C" to kill it.)
The final step is to get X11 connectivity to your interactive job. To do this, simply add the "-X" option to the "qsub" command that you use to request the job. Once again, you can test using "xclock" or "xeyes".
Note that X11 connectivity to a non-interactive job is not possible under normal circumstances. There is no SSH connection between your login session and the batch job for tunneling the X11 traffic, and the compute nodes are blocked from establishing out-going X11 connections ... even if your computer has been configured to accept them.