|
Latest update: 2 Oct 2005
Magnesium
Our little cluster
Sponsored by
The Magnus Ehrnrooth foundation
Introduction
Magnesium is the frontend for our local little
Debian Linux cluster. For now,
the cluster has 15+1 nodes with the following hardware specs:
- 2.4 GHz AMD64 CPU's ("3400+") on Asus K8V SE Deluxe K8T800 M/B's
- 1.5 GB DDR333 RAM (11 nodes) / 3.0 GB DDR333 RAM (4 nodes)
- 340 GB /wrk, RAID-0, XFS (2 × Seagate Barracuda 7200.7 SATA)
The nodes are connected by pseudo-gigabit on-board ethernet and a
ZyXEL GS-1116 switch. The latency is 18 µs (not that bad),
the maximum throughput 460 Mbps (not that good) for packet sizes
over 1 MB (TCP, see figure, tested with
NetPIPE).
See below for latest news and changes.
Index of this page
Very brief rules of use
- No interactive jobs allowed, use qsub.
- If possible, jobs should run on /wrk, not on /home.
- Only install programs in your home directory, even if you
could install elsewhere. Do not modify anything else either.
Otherwise cluster management will become impossible.
Usage
The cluster is intended for computations, nothing else. The front end is
to be used for job submission to the queue system.
Interactive work is only allowed on the front end, and even then,
no interactive calculations are allowed! Only preparations of
jobs scripts and such are to be performed. Program
compilation should also be done elsewhere, if possible.
The front end will be busy enough as a file server and queue master,
as well as test.q home.
See below for info on different queues and
node access.
User accounts
Magnesium is a separate entity, only loosely connected to our other
machines. Therefore you need a separate user account for it. This
you get by mailing me at
mikael.johansson@helsinki.fi or knocking on my door when I'm behind
it. There are a few things that need to be set up for you before you can
begin to use the cluster.
Queue system
Magnesium uses the Sun Grid
Engine SGE6 queueing system (version 6.0u4). All jobs must be
run via the queue(s)!
Job scripts
Submitting jobs
- After preparing the job script, submit it to the queue with the
qsub command, for example:
- To see your job statuses (queueing or running, etc.),
use the qstat command. With qstat -f you
get more info. With qstat -ext even more. And for the brave,
there's qstat -f -ext -r, which already wants a console 184
columns wide... The qstat output is quite wide always,
so a useful alias might be:
- alias q='qstat | cut -c 1-102 | grep -E job-ID\|$USER\|---'
Parallel jobs
General note on parallel jobs: If you don't know what you are doing,
then don't!
Magnesium supports parallel jobs via the
MPICH message passing interface
(version 1.2.6).
LAM/MPI might be coming. A coexistence with MPICH is not
without problems; SGE and LAM/MPI are not the best of friends.
SGE, the queueing system, is not optimal for managing parallel jobs, so
there are a few things that you must use in your job script when
running parallel jobs.
- You need to specify that you want to run a parallel job with the
-pe option. Below, <numprocs> defines
how many CPU's you want to use. The maximum allowed is 6!
Adhere. Also, use the -V switch.
- #$ -V
- #$ -pe mpich <numprocs>
- You have to set the environment variable
MPICH_PROCESS_GROUP to no:
- export MPICH_PROCESS_GROUP=no
- Optimise MPICH message transfer speed with:
- export P4_SOCKBUFSIZE=128000
The P4_SOCKBUFSIZE is not related to SGE,
but will significantly improve performance if large packets
are transferred. MPICH cannot be faster than raw TCP. The
latency increases to 25 µs, and the maximum transfer
speed drops to about 410 Mbps,
see figure. Values larger than 128000 could perhaps be useful.
Many parallel program packages need a directory that is visible to all
processes on the nodes used. If this is the case, the only alternative is
to run at least partly on /home. But even if some files need to be
accessible to all processes, others, like two electron integral files
and such, might not be. Find out the situation for your program.
In any case, it is forbidden to store 2e-integrals on /home!
(The disk quota effectively takes care of this anyway, but it is still
forbidden.) Minimize other disk access to /home as well! For example,
don't dump MO files after every SCF iteration, etc.
Parallel Turbomole specific
- For Turbomole, there are three additional environment variables that
have to be set in the job script:
- You need to tell Turbomole that you would like to run the
parallel version with
- Turbomole wastes one CPU for keeping track of the parallel run.
For this reason, you need to reserve one more CPU than what
will be used for the actual calculation. In other words, the
environment variable PARNODES needs to be set
to an integer one less than the <numprocs>
requested (see above):
- export PARNODES=<numprocs-1>
- You have to specify the environment variable
HOSTS_FILE, otherwise Turbomole will ignore what SGE
tells it to use, and just start on whatever nodes it feels like:
- export HOSTS_FILE=$TMPDIR/machines
- For more info on parallel control keywords for Turbomole, see
the TM manual.
Example parallel job script:
General note on parallel jobs: If you don't know what you are doing,
then don't.
Available queues
For now, magnesium has three different queues; test.q,
chem-smallmem.q and chem-largemem.q.
You should choose which queue you would like to use with the -q
parameter in your job script, for example:
The queues have different limits. Choose the one that suits your
needs.
| queue |
time limit |
mem limita) |
real mem |
max diskb) |
slots |
| test.q |
10 min |
1000 MB |
1.5 GB |
5 GB |
1 |
| chem-smallmem.q |
1400 h |
2500 MB |
1.5 GB |
340 GB |
11 |
| chem-largemem.q |
700 h |
4500 MB |
3.0 GB |
340 GB |
4 |
a) The mem limit refers to the total memory used by your
job processes, including data and stack. If you exceed the memory
limit on largemem.q, your job is not suitable for magnesium.
b) The max disk is really the maximum that you
in principle have available. In reality, it will be lower. On test.q,
the maximum size includes your home dir (it is in fact your
disk quota). On chem-*.q, /wrk might not be totally empty.
The test.q can be used for example to test if a job
really would start with the job script you've meticulously prepared.
It's also useful for diagnosis of programs that seem to crash right
after start.
The jobs in test.q run on the front end, so you don't necessarily have to
copy your files to a /wrk-dir. And if you do, you don't have to log onto
a node to check the files produced.
You can choose to run in either chem-smallmem.q or
chem-largemem.q by selecting any chem queue with a wildcard,
like chem*, i.e.:
This way, your job will be submitted to the first slot that becomes
(is) available on a smallmem or largemem host, in this order.
To see for example how many free slots the queues have available:
Available programs
All available program packages are installed under /home/opt; ls
that for a list. For now, the installed programs are:
- Turbomole 5-7-1 (serial and parallel)
- Gaussian03 (serial only)
If you have some other programs that you would like to share with other
people, i.e., install under /home/opt,
mail me. First, test that it works on magnesium, though.
Accessing the nodes
Ideally, you should not need to log on to the compute nodes of magnesium.
Running interactive jobs on the nodes is completely forbidden.
But there are situations when logging on to the nodes is
necessary, like checking files after a job crash/completion.
Anyway, there is no strict rule against logging in.
If you need to log on, use ssh mgXX where XX identifies
the node and runs from 01–15. For example
You should not be asked for a password when you log in. If the system does
ask for one, contact me.
NOTE! There is no need to log on to a node for following
job output and such. For this, there exists a special command,
dsh
(for distributed shell, or dancer's shell). dsh runs the
command you specify on a host you specify. You only need to tell
dsh which node (machine) you want to run on with the
-m parameter. An example:
- dsh -m mg01 "tail -f /wrk/users/borg/output.out"
You could of course do the same thing directly with ssh, if you prefer:
- ssh mg01 "tail -f /wrk/users/borg/output.out"
The advantage of dsh is that you can specify more than one machine
to perform an action on. With the -c switch, your command is
executed simultaneously on all specified machines. More info via man dsh.
There is a special parameter for executing a command on all nodes; -a.
For example, if you would like to see the main processes on all nodes, you
could execute one of these:
- dsh -a -M "export TERM=vt100; top -b -n 1 | grep -A 4 USER"
- dsh -a "export TERM=vt100; top -b -n 1 | grep $USER | head -4"
To quickly clean up (delete) all your files from all /wrk disks (note that
the front end is not included with the -a switch):
- dsh -a -M -c "rm -rf /wrk/users/$USER/*"
For a list of all nodes as well as some info about them, you can
use qhost. Withe the -j switch, it's an alternative
to qstat:
Latest news and changelog
- 02 Oct 2005: All 15 nodes up. Enabled printing from the front end.
- 30 Sep 2005: Updated "largemem" host info. Updated kernels to 2.6.12-1.
- 15 Jun 2005: MPICH parallel jobs enabled.
- 13 Jun 2005: Added chem-* queues. Updated kernels to 2.6.11-9 as 2.6.8-11 panics.
- 11 Jun 2005: All 11 nodes up.
- 10 Jun 2005: /home was converted from XFS to ext3 just in case.
- 09 Jun 2005: Winter hibernation ceases, Mg awakens with one node up.
Comments
Questions, comments and oddity reports are welcome at
mikael.johansson@helsinki.fi
(RTFM reservation).
Mg in the night
Half a cluster on a warm summer's night.
|