QRIScloud Known Issues

Follow

The following known operational issues relate to systems that QRIScloud supports

QRISCloud NeCTAR issues:

  1. The QRIScloud availability zone is full, and the ability to launch new instances is curtailed.  Users attempting to launch new instances in QRIScloud are likely to see a "No available host" error message. If you see this message, we recommend that you try to launch in a different availability zone.
  2. Instances that were created prior to 6:00 PM AEST on 6 Oct 2015 need to be hard rebooted to allow instance snapshots to work properly.  (You only need to do this once.)

Login / data transfer issues:

These issues apply the QRISdata Collection access services, Euramoo, Flashlite and Tinaroo:

  1. Each of these systems is accessed via an "haproxy" load balancer that balances the load across 2 or more login or data transfer nodes:
    • The load balancers employ connection rate limiting.  If there are too many simultaneous connections or connection attempts from the same system, connections can be closed abruptly with this message:
          ssh_exchange_identification: read: Connection reset by peer
    • The data transfer rate through the load balancers is limited by hardware. 
    • For bulk data transfers, it is advisable to connect to the login nodes / data transfer nodes directly to avoid these problems.  This can also be used as a work-around if you are experiencing problems with login sessions.
      Service Load balancer Login / transfer nodes
      Data access data.qriscloud.org.au ssh1.qriscloud.org.au
      ssh2.qriscloud.org.au
      Euramoo euramoo.qriscloud.org.au

      euramoo1.qriscloud.org.au
      euramoo2.qriscloud.org.au
      euramoo3.qriscloud.org.au

      Flashlite flashlite.rcc.uq.edu.au flashlite1.rcc.uq.edu.au
      flashlite2.rcc.uq.edu.au
      Tinaroo tinaroo.rcc.uq.edu.au tinaroo1.rcc.uq.edu.au
      tinaroo2.rcc.uq.edu.au
  2. The login / data transfer nodes are all configured with "fail2ban" to protect against password guessing.
    • After a small number (typically 3) of failed login attempts, further attempts at login will be refused, even if you get the account and password correct. 
    • This "banning" typically lasts for 10 minutes.
    • If you use SSH keys to login, and your SSH clients offers multiple SSH keypairs, each "offer" that is not accepted is counted by "fail2ban" as a failed login attempt.  If you have lots of keypairs in (for example) your "~/.ssh" directory, you can actually be banned before your SSH client offers the right keypair.  The solution is to use the "-i" option to specify the key to be used.
  3. The different systems accept different account / password credentials:
    • The QRIScloud collection access services accept QSAC credentials only.
    • Euramoo and Flashlite will also accept a UQ password, if your QSAC and US account names are the same.
    • The Tinaroo documentation, etc states that  "QSAC credentials are not guaranteed to work".

Euramoo issues:

  1. PBS file stage-in and stage-out does not work, and is liable to CRASH the PBS system. Please do not attempt to use it.
  2. Inconsistencies with the "modules" system on Euramoo:
    • Running "module avail" on a Euramoo login node tells you about modules for the "amd" node type only.  To get the information for the "intel" node type, you need to run "module avail" on (for example) an interactive session on an "intel" node.
    • The sets of software installed on "amd" and "intel" node types are inconsistent.
    • Modules (and "/sw") should not be used on "biolinux" nodes.  UPDATE - the modules have now been removed from the biolinux nodes, though the "/sw" trees are still mounted.
    • The module organization is different to Flashlite / Tinaroo.
  3. RDSI collections are not accessible to jobs on Biolinux nodes.  Attempting to access a file or directory via "/RDS" will fail.
  4. There is a bug in the way that the "qsub" command handles memory resource request.  If you specify a "vmem" resource without a "mem" resource, the qsub filter is supposed to set "mem" to "vmem".  In fact, this is not happening for interactive jobs, and "mem" is left at its default value ("1GB").  The workaround is to specify both "mem" and "vmem" for interactive jobs.

Flashlite issues:

  1. Flashlite new account provisioning is sometimes failing to create users' home directories.
  2. Flashlite is sometimes failing to automount collections in the /RDS tree.

Please also refer to the RCC Active Incidents page for current Euramoo, Flashlite, Tinaroo & Barrine incidents.

Have more questions? Submit a request

Comments

Powered by Zendesk