Advice: Scheduler changes on Awoonga, Flashlite and Tinaroo

Follow

UQ RCC operations staff are in the process of changing Awoonga, Flashlite and Tinaroo from the TORQUE batch job scheduler to PBS Pro.  The goal of the changeover is to improve the behavior of job scheduling and provide facilities for managing and monitoring jobs. In the long term, it will also allow us to move towards implementing a scheduler that spans all 3 clusters.

Unfortunately, while TORQUE and PBS Pro are similar, there are important differences between them:

  • The names of some of the environment variables used by the two schedulers are different.
  • The options for requesting job arrays and specifying resources are different.

In an attempt to make the transition transparent to users, RCC have implemented a wrapper for "qsub" that translates TORQUE syntax jobs and job parameters to their PBS Pro equivalents.  Unfortunately, the translation is somewhat "heuristic" in its approach and there are edge cases where it is not working properly.  (For example, we have seen cases where the problem was caused by the original script quoting arguments in an unexpected way.)

What should you do if you have this problem?

Approach #1: Simplify / fix the job script

What you need to do is to identify the line in your original (TORQUE compatible) job script and figure or what is causing the translation to break.  Then fix it.  Unfortunately, in order to do this you need to understand Linux shell syntax; i.e. how shell quoting and variable expansion really works.

Approach #2: Translate the job script to PBS Pro

If you look at the "/usr/local/bin/qsub" shell script, you will see that it is ultimately calling the PBS Pro version of the "qsub" command; i.e. "/opt/pbs/bin/qsub".

Therefore, it is possible to side-step any problems with TORQUE to PBS Pro translation as follows:

  1. Manually translate your job script to use PBS Pro compatible options and PBS Pro compatible environment variable names.
  2. Submit the job directly to the PBS Pro scheduler by using the "/opt/pbs/bin/qsub" command instead of "qsub".

If this works for you, you could make "/opt/pbs/bin/qsub" your personal default, by modifying your PATH variable or adding a shell alias.  But if you do this, you need to remember that you have done this, because you may need to change it again in the future.

In the future ...

The above describes to the situation as of now.  Things may change again:

  • RCC is considering switching to the SLURM scheduler at some point in the future.
  • If RCC decide to stick with PBS Pro as the scheduler in the long term, it may "make sense" to make the PBS Pro version of "qsub" the default one.  It that happens, users may need to modify their scripts explicitly.

UPDATE: 2017-11-24 - 14:40.  Earlier, people were seeing this error when submitting jobs:

    /usr/local/bin/qsub: line 64: unexpected EOF while looking for matching `"'

It was due to a simple bug in the "qsub" wrapper script, and it has now been fixed.

 

Have more questions? Submit a request

Comments

Powered by Zendesk