Latest update: 2 Oct 2005

Magnesium

Our little cluster

Sponsored by The Magnus Ehrnrooth foundation

Introduction

Magnesium is the frontend for our local little Debian Linux cluster. For now, the cluster has 15+1 nodes with the following hardware specs:

The nodes are connected by pseudo-gigabit on-board ethernet and a ZyXEL GS-1116 switch. The latency is 18 µs (not that bad), the maximum throughput 460 Mbps (not that good) for packet sizes over 1 MB (TCP, see figure, tested with NetPIPE).

See below for latest news and changes.


Index of this page


Very brief rules of use


Usage

The cluster is intended for computations, nothing else. The front end is to be used for job submission to the queue system. Interactive work is only allowed on the front end, and even then, no interactive calculations are allowed! Only preparations of jobs scripts and such are to be performed. Program compilation should also be done elsewhere, if possible.

The front end will be busy enough as a file server and queue master, as well as test.q home.

See below for info on different queues and node access.


User accounts

Magnesium is a separate entity, only loosely connected to our other machines. Therefore you need a separate user account for it. This you get by mailing me at mikael.johansson@helsinki.fi or knocking on my door when I'm behind it. There are a few things that need to be set up for you before you can begin to use the cluster.

Queue system

Magnesium uses the Sun Grid Engine SGE6 queueing system (version 6.0u4). All jobs must be run via the queue(s)!

Job scripts

Submitting jobs


Parallel jobs

General note on parallel jobs: If you don't know what you are doing, then don't!

Magnesium supports parallel jobs via the MPICH message passing interface (version 1.2.6). LAM/MPI might be coming. A coexistence with MPICH is not without problems; SGE and LAM/MPI are not the best of friends.

SGE, the queueing system, is not optimal for managing parallel jobs, so there are a few things that you must use in your job script when running parallel jobs.

  1. You need to specify that you want to run a parallel job with the -pe option. Below, <numprocs> defines how many CPU's you want to use. The maximum allowed is 6! Adhere. Also, use the -V switch.

  2. You have to set the environment variable MPICH_PROCESS_GROUP to no:

  3. Optimise MPICH message transfer speed with:
    The P4_SOCKBUFSIZE is not related to SGE, but will significantly improve performance if large packets are transferred. MPICH cannot be faster than raw TCP. The latency increases to 25 µs, and the maximum transfer speed drops to about 410 Mbps, see figure. Values larger than 128000 could perhaps be useful.

Many parallel program packages need a directory that is visible to all processes on the nodes used. If this is the case, the only alternative is to run at least partly on /home. But even if some files need to be accessible to all processes, others, like two electron integral files and such, might not be. Find out the situation for your program. In any case, it is forbidden to store 2e-integrals on /home! (The disk quota effectively takes care of this anyway, but it is still forbidden.) Minimize other disk access to /home as well! For example, don't dump MO files after every SCF iteration, etc.

Parallel Turbomole specific

Example parallel job script:

General note on parallel jobs: If you don't know what you are doing, then don't.

Available queues

For now, magnesium has three different queues; test.q, chem-smallmem.q and chem-largemem.q. You should choose which queue you would like to use with the -q parameter in your job script, for example:

The queues have different limits. Choose the one that suits your needs.

queue time limit mem limita) real mem max diskb) slots
test.q 10 min 1000 MB 1.5 GB 5 GB 1
chem-smallmem.q 1400 h 2500 MB 1.5 GB 340 GB 11
chem-largemem.q 700 h 4500 MB 3.0 GB 340 GB 4
a) The mem limit refers to the total memory used by your job processes, including data and stack. If you exceed the memory limit on largemem.q, your job is not suitable for magnesium.
b) The max disk is really the maximum that you in principle have available. In reality, it will be lower. On test.q, the maximum size includes your home dir (it is in fact your disk quota). On chem-*.q, /wrk might not be totally empty.

The test.q can be used for example to test if a job really would start with the job script you've meticulously prepared. It's also useful for diagnosis of programs that seem to crash right after start.

The jobs in test.q run on the front end, so you don't necessarily have to copy your files to a /wrk-dir. And if you do, you don't have to log onto a node to check the files produced.

You can choose to run in either chem-smallmem.q or chem-largemem.q by selecting any chem queue with a wildcard, like chem*, i.e.:

This way, your job will be submitted to the first slot that becomes (is) available on a smallmem or largemem host, in this order.

To see for example how many free slots the queues have available:


Available programs

All available program packages are installed under /home/opt; ls that for a list. For now, the installed programs are: If you have some other programs that you would like to share with other people, i.e., install under /home/opt, mail me. First, test that it works on magnesium, though.

Accessing the nodes

Ideally, you should not need to log on to the compute nodes of magnesium. Running interactive jobs on the nodes is completely forbidden. But there are situations when logging on to the nodes is necessary, like checking files after a job crash/completion. Anyway, there is no strict rule against logging in.

If you need to log on, use ssh mgXX where XX identifies the node and runs from 01–15. For example

You should not be asked for a password when you log in. If the system does ask for one, contact me.

NOTE! There is no need to log on to a node for following job output and such. For this, there exists a special command, dsh (for distributed shell, or dancer's shell). dsh runs the command you specify on a host you specify. You only need to tell dsh which node (machine) you want to run on with the -m parameter. An example:

You could of course do the same thing directly with ssh, if you prefer:

The advantage of dsh is that you can specify more than one machine to perform an action on. With the -c switch, your command is executed simultaneously on all specified machines. More info via man dsh.

There is a special parameter for executing a command on all nodes; -a. For example, if you would like to see the main processes on all nodes, you could execute one of these:

To quickly clean up (delete) all your files from all /wrk disks (note that the front end is not included with the -a switch):

For a list of all nodes as well as some info about them, you can use qhost. Withe the -j switch, it's an alternative to qstat:


Latest news and changelog


Comments

Questions, comments and oddity reports are welcome at mikael.johansson@helsinki.fi (RTFM reservation).

Mg in the night

Half a cluster on a warm summer's night.

[Half of Mg in the night]