Department of Chemistry
Laboratory for Instruction in Swedish
[Google]
  Magnesium - our little cluster
Home   Information   Staff   Research   Education  
Research
Overview
  New species
  N5+
RTAM
Supplements
Kuusamo 2001
Magnesium
Log in
Language
English
Svenska

Latest update: 2 Oct 2005

Magnesium

Our little cluster

Sponsored by The Magnus Ehrnrooth foundation

Introduction

Magnesium is the frontend for our local little Debian Linux cluster. For now, the cluster has 15+1 nodes with the following hardware specs:
  • 2.4 GHz AMD64 CPU's ("3400+") on Asus K8V SE Deluxe K8T800 M/B's
  • 1.5 GB DDR333 RAM (11 nodes) / 3.0 GB DDR333 RAM (4 nodes)
  • 340 GB /wrk, RAID-0, XFS (2 × Seagate Barracuda 7200.7 SATA)

The nodes are connected by pseudo-gigabit on-board ethernet and a ZyXEL GS-1116 switch. The latency is 18 µs (not that bad), the maximum throughput 460 Mbps (not that good) for packet sizes over 1 MB (TCP, see figure, tested with NetPIPE).

See below for latest news and changes.


Index of this page


Very brief rules of use

  • No interactive jobs allowed, use qsub.
  • If possible, jobs should run on /wrk, not on /home.
  • Only install programs in your home directory, even if you could install elsewhere. Do not modify anything else either. Otherwise cluster management will become impossible.

Usage

The cluster is intended for computations, nothing else. The front end is to be used for job submission to the queue system. Interactive work is only allowed on the front end, and even then, no interactive calculations are allowed! Only preparations of jobs scripts and such are to be performed. Program compilation should also be done elsewhere, if possible.

The front end will be busy enough as a file server and queue master, as well as test.q home.

See below for info on different queues and node access.


User accounts

Magnesium is a separate entity, only loosely connected to our other machines. Therefore you need a separate user account for it. This you get by mailing me at mikael.johansson@helsinki.fi or knocking on my door when I'm behind it. There are a few things that need to be set up for you before you can begin to use the cluster.

Queue system

Magnesium uses the Sun Grid Engine SGE6 queueing system (version 6.0u4). All jobs must be run via the queue(s)!

Job scripts

  • To submit a job to the queue, you need to prepare a job script. The job script is like a normal script of commands with additional settings related to the queue, like which queue to use, where to report job status, etc.

  • Just to get the idea, a very simple example job script could be the following:
        #!/bin/sh
        #$ -o outputfile.out 
        #$ -N NameOfJob 
        #$ -q test.q
    
        sleep 60
        echo "took a nap."
    
  • This "job" of course does nothing useful. It does show the general idea of a job script, though: First comes a bunch of parameters (lines beginning with #$), then the commands that will be run once the queue system submits the job to some node for execution.

  • The most important job parameters are:
    option explanation
    -o The file where the stdout of the job is directed.
    -j stderr is combined with stdout.
    -N The name of the job. It is a good idea to give your different jobs different names, it eases identification.
    -S Which shell to use.
    -V Copies the current environment to the job session.
    -cwd Execute from current working directory.
    -m Sends email at specific occasions. Common occasions are e (end of job) and b (beginning of job)
    -M Your e-mail address.
    -q Name of the queue. See below for a list of available queues.

  • For a complete list of options/parameters, execute man qsub.

  • A few example job scripts:

  • Both scripts do things a bit differently, but have one thing in common: The job files are first copied to the local /wrk disks of the nodes!
  • This is very important! The /home directories are mounted over NFS and should not host calculations. In other words: It is forbidden to perform calculations in your /home directory! The exception to the rule is jobs in the test.q, which run on the front end.

  • So copy your files somewhere under /wrk/users/$USER, change to that directory, perform your calculation, move the resulting files back to your /home directory (which is visible from all nodes). NOTE! You must delete the files from /wrk after your calculation!

  • There is also no guarantee as of how long the files on /wrk will stay there. They will be deleted automatically after 90 days. If /wrk is getting full, files that don't belong to the running job on the node can be deleted without further warning.

Submitting jobs

  • After preparing the job script, submit it to the queue with the qsub command, for example:
    • qsub turbomole-job.cmd

  • To see your job statuses (queueing or running, etc.), use the qstat command. With qstat -f you get more info. With qstat -ext even more. And for the brave, there's qstat -f -ext -r, which already wants a console 184 columns wide... The qstat output is quite wide always, so a useful alias might be:
    • alias q='qstat | cut -c 1-102 | grep -E job-ID\|$USER\|---'

Parallel jobs

General note on parallel jobs: If you don't know what you are doing, then don't!

Magnesium supports parallel jobs via the MPICH message passing interface (version 1.2.6). LAM/MPI might be coming. A coexistence with MPICH is not without problems; SGE and LAM/MPI are not the best of friends.

SGE, the queueing system, is not optimal for managing parallel jobs, so there are a few things that you must use in your job script when running parallel jobs.

  1. You need to specify that you want to run a parallel job with the -pe option. Below, <numprocs> defines how many CPU's you want to use. The maximum allowed is 6! Adhere. Also, use the -V switch.
    • #$ -V
    • #$ -pe mpich <numprocs>

  2. You have to set the environment variable MPICH_PROCESS_GROUP to no:
    • export MPICH_PROCESS_GROUP=no

  3. Optimise MPICH message transfer speed with:
    • export P4_SOCKBUFSIZE=128000

    The P4_SOCKBUFSIZE is not related to SGE, but will significantly improve performance if large packets are transferred. MPICH cannot be faster than raw TCP. The latency increases to 25 µs, and the maximum transfer speed drops to about 410 Mbps, see figure. Values larger than 128000 could perhaps be useful.

Many parallel program packages need a directory that is visible to all processes on the nodes used. If this is the case, the only alternative is to run at least partly on /home. But even if some files need to be accessible to all processes, others, like two electron integral files and such, might not be. Find out the situation for your program. In any case, it is forbidden to store 2e-integrals on /home! (The disk quota effectively takes care of this anyway, but it is still forbidden.) Minimize other disk access to /home as well! For example, don't dump MO files after every SCF iteration, etc.

Parallel Turbomole specific

  • For Turbomole, there are three additional environment variables that have to be set in the job script:

    1. You need to tell Turbomole that you would like to run the parallel version with
      • export PARA_ARCH=MPI

    2. Turbomole wastes one CPU for keeping track of the parallel run. For this reason, you need to reserve one more CPU than what will be used for the actual calculation. In other words, the environment variable PARNODES needs to be set to an integer one less than the <numprocs> requested (see above):
      • export PARNODES=<numprocs-1>

    3. You have to specify the environment variable HOSTS_FILE, otherwise Turbomole will ignore what SGE tells it to use, and just start on whatever nodes it feels like:
      • export HOSTS_FILE=$TMPDIR/machines

  • For more info on parallel control keywords for Turbomole, see the TM manual.

Example parallel job script:

General note on parallel jobs: If you don't know what you are doing, then don't.

Available queues

For now, magnesium has three different queues; test.q, chem-smallmem.q and chem-largemem.q. You should choose which queue you would like to use with the -q parameter in your job script, for example:
  • #$ -q chem-smallmem.q

The queues have different limits. Choose the one that suits your needs.

queue time limit mem limita) real mem max diskb) slots
test.q 10 min 1000 MB 1.5 GB 5 GB 1
chem-smallmem.q 1400 h 2500 MB 1.5 GB 340 GB 11
chem-largemem.q 700 h 4500 MB 3.0 GB 340 GB 4
a) The mem limit refers to the total memory used by your job processes, including data and stack. If you exceed the memory limit on largemem.q, your job is not suitable for magnesium.
b) The max disk is really the maximum that you in principle have available. In reality, it will be lower. On test.q, the maximum size includes your home dir (it is in fact your disk quota). On chem-*.q, /wrk might not be totally empty.

The test.q can be used for example to test if a job really would start with the job script you've meticulously prepared. It's also useful for diagnosis of programs that seem to crash right after start.

The jobs in test.q run on the front end, so you don't necessarily have to copy your files to a /wrk-dir. And if you do, you don't have to log onto a node to check the files produced.

You can choose to run in either chem-smallmem.q or chem-largemem.q by selecting any chem queue with a wildcard, like chem*, i.e.:

  • #$ -q chem*
This way, your job will be submitted to the first slot that becomes (is) available on a smallmem or largemem host, in this order.

To see for example how many free slots the queues have available:

  • qstat -g c

Available programs

All available program packages are installed under /home/opt; ls that for a list. For now, the installed programs are:
  • Turbomole 5-7-1 (serial and parallel)
  • Gaussian03 (serial only)
If you have some other programs that you would like to share with other people, i.e., install under /home/opt, mail me. First, test that it works on magnesium, though.

Accessing the nodes

Ideally, you should not need to log on to the compute nodes of magnesium. Running interactive jobs on the nodes is completely forbidden. But there are situations when logging on to the nodes is necessary, like checking files after a job crash/completion. Anyway, there is no strict rule against logging in.

If you need to log on, use ssh mgXX where XX identifies the node and runs from 01–15. For example

  • ssh mg07
You should not be asked for a password when you log in. If the system does ask for one, contact me.

NOTE! There is no need to log on to a node for following job output and such. For this, there exists a special command, dsh (for distributed shell, or dancer's shell). dsh runs the command you specify on a host you specify. You only need to tell dsh which node (machine) you want to run on with the -m parameter. An example:

  • dsh -m mg01 "tail -f /wrk/users/borg/output.out"

You could of course do the same thing directly with ssh, if you prefer:

  • ssh mg01 "tail -f /wrk/users/borg/output.out"

The advantage of dsh is that you can specify more than one machine to perform an action on. With the -c switch, your command is executed simultaneously on all specified machines. More info via man dsh.

There is a special parameter for executing a command on all nodes; -a. For example, if you would like to see the main processes on all nodes, you could execute one of these:

  • dsh -a -M "export TERM=vt100; top -b -n 1 | grep -A 4 USER"
  • dsh -a "export TERM=vt100; top -b -n 1 | grep $USER | head -4"

To quickly clean up (delete) all your files from all /wrk disks (note that the front end is not included with the -a switch):

  • dsh -a -M -c "rm -rf /wrk/users/$USER/*"

For a list of all nodes as well as some info about them, you can use qhost. Withe the -j switch, it's an alternative to qstat:

  • qhost -j

Latest news and changelog

  • 02 Oct 2005: All 15 nodes up. Enabled printing from the front end.
  • 30 Sep 2005: Updated "largemem" host info. Updated kernels to 2.6.12-1.
  • 15 Jun 2005: MPICH parallel jobs enabled.
  • 13 Jun 2005: Added chem-* queues. Updated kernels to 2.6.11-9 as 2.6.8-11 panics.
  • 11 Jun 2005: All 11 nodes up.
  • 10 Jun 2005: /home was converted from XFS to ext3 just in case.
  • 09 Jun 2005: Winter hibernation ceases, Mg awakens with one node up.

Comments

Questions, comments and oddity reports are welcome at mikael.johansson@helsinki.fi (RTFM reservation).

Mg in the night

Half a cluster on a warm summer's night.

[Half of Mg in the night]

Feedback Printable page
Updated 04.10.2005 - 01:01