Search:  
Gentoo Wiki

HOWTO_Torque/Maui_-_grid_scheduler_and_resource_manager

This article is part of the HOWTO series.
Installation Kernel & Hardware Networks Portage Software System X Server Gaming Non-x86 Emulators Misc

Contents

Introduction

More and more Linux clusters are being built. To build a cluster, you need a lot of software, among which the most important are the scheduler and resource manager.

Many commercial schedulers come with an embedded resource manager. However, most open source schedulers don't, so you need to install a separate resource manager.

This HOWTO will show you how to install and configure Torque, a resource manager, and Maui, an advanced cluster scheduler with many advanced features. Both are open source.

Hardware setup & naming convention

I will set up a 4-node cluster, using 4 machines with different hardware setups. I have a mix of AMD Sempron and Intel Pentium III processors, and the amount of installed RAM ranges from 128MB to 1024MB. The only thing the nodes have in common is Gentoo.

I will use a P-III 750MHz/128MB laptop to submit jobs and as a compute node, a P-III 933MHz/512MB RAM as the master/head node, and the two last machines will serve as compute nodes.

HostnameRoleHardware
mainhead node/master nodeP-III 933MHz/512MB RAM
kittyjob submission/compute nodeP-III 750MHz/128MB RAM Laptop
valinorcompute node
caladancompute node

It is mandatory that your /etc/hosts file is set up correctly and that the hostnames comes first in your list, followed by the FQDN. As an alternative, you could set up a DNS server to resolve names in your network.

Also I will be using the following variables to define some directories path:

Requirements

All clusters require cluster management software; I use xCAT. If you don't already have cluster management software installed, I suggest you follow my xCAT HOWTO (coming soon).

We're going to install Torque and Maui. The latest Maui release is in the portage tree, but the Torque ebuild is for an old version. We will need to download and compile it the latest version of Torque ourselves. Don't emerge anything yet!

Maui is free, but you must register on their website at http://www.clusterresources.com/product/maui/index.php to download it. Once you've got the Maui package, copy it to /usr/portage/distfiles.

Your nodes can obtain their IP addresses either through DHCP or you can give them static addresses; however, name resolution must work, so make sure that your /etc/hosts or DNS servers are properly set up.

Installation

Network and name resolution must work properly before starting the installation. Let's make sure that all our nodes are set up right - each node should be able to ping all other nodes.

Torque

Build

First, untar your torque package and then cd to the directory. Run ./configure --help to see all the options. By default, Torque builds with GUI support, sets the default server to the hostname of the machine you are compiling the server, . I will change the default spool directory setting the -set-server-home to /home/PBS_spool, also enable the server, monitor and clients, and will use scp as the file transfer tool.

Warning: It is imperative that the compiler be POSIX-compliant. GCC works just fine.
Warning:
From now on $TORQUECFG=/home/PBS_spool


Code: Torque installation:
# export TORQUECFG=/home/PBS_spool
# tar xvzf torque-2.0.0p8.tar.gz
# cd torque-2.0.0p8
# ./configure --enable-server --enable-monitor\
  --enable-clients --with-server-home=$TORQUECFG\
  --with-scp
# make
# make install


Tip: You could also set --bindir=/usr/bin and --sbindir=/usr/sbin to have the binaries in the default system directory instead of /usr/local/bin and /usr/local/sbin respectively


If you have not followed the tip, then you must check if you have /usr/local/bin and /usr/local/sbin set in your $PATH. If you don´t, set it in /etc/profile, for example, and update your environment and source your profile so the new variables are set:

Code: Updating the environment
# env-update && source /etc/profile

Configuration

We need to create the initial database which holds all the configuration, then we are going to start qmgr and set all the parameters for the pbs_server. To accomplish that, just start the pbs_server with -t create, and then run qmgr which will gives us a shell prompt to set some parameters:

Code: Configuring pbs_server
# pbs_server -t create
# qmgr
Qmgr: set server operators = root@mail;
Qmgr: set server operators += pbsuser@mail
Qmgr: create queue batch
Qmgr: set queue batch queue_type = Execution
Qmgr: set queue batch started = True
Qmgr: set queue batch enabled = True
Qmgr: set server default_queue = batch
Qmgr: set server resources_default.nodes = 1
Qmgr: set server scheduling = True
Qmgr: quit
#

Also we need to tell pbs_server which machines it must contact:

File: $TORQUECFG/server_priv/nodes
valinor
caladan
kitty


Tip: If your compute nodes have more than one processor, just add np=X beside the name, where X is the number of processor. Example: node001 np=2


Ok. Now let´s build the packages for the other nodes, copy those packages to the compute nodes and install them:

Code: Build and copy packages
# cd /tmp/torque-2.0.0p8
# make packages
# pscp compute torque-package-mom-linux-i686.sh
# pscp compute torque-package-clients-linux-i686.sh
# psh compute torque-package-clients-linux-i686.sh --install
# psh compute torque-package-mom-linux-i686.sh --install

Also it is a good idea to check if the nodes know who is the master:

Code:
# cat $TORQUECFG/server_name
main


Last part is make sure we can stage data between the nodes. For that you will need the nodes to be able to ssh to each other without prompting for password.


Tip: You should read this excellent article about SSH key management, written by Daniel Robbins (drobbins@gentoo.org) .
http://www-128.ibm.com/developerworks/library/l-keyc.html


Choose one compute node and edit it´s configuration file, then copy this file to all other nodes:

File: $TORQUECFG/mom_priv/config:
arch            x86
opsys           Gentoo
$logevent       255


Start pbs_mom on all compute nodes, then kill the server using qterm -t quick and restart it using pbs_server. Wait a few moments and then check nodes availability using pbsnodes -a. You should see all the nodes listed with their features.


There you go! Ok, but your scheduler/resource manager installation is not done yet! :) If you check the queue with qstat, you will see that the status of the job is Q. This is because we don´t have a scheduler yet. We need to set up Maui now, so it can interact with Torque and schedule our jobs flawlessly!


Maui

Build

Ok, with the maui tarball in hands (got to download it as previously said), let´s start the build process:

Code: Compiling maui:
# export MAUIDIR=/home/maui
# tar xvzf maui-3.2.6p13.tar.gz
# ./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
# make
# make install


If everything compiled ok, we need to add $MAUIADMIN as a Torque Manager. Also we are going to set the default node count to one and default walltime to 5 minutes (system-wide values):

Code: Add $MAUIADMIN as a manager:
# qmgr
Qmgr: set server managers += root@mail
Qmgr: set server resources_default.nodect = 1
Qmgr: set server resources_default.walltime = 00:05:00
Qmgr: quit


Done! Maui is already integrated with Torque! All you gotta do is start Maui (/usr/local/maui/sbin/maui) and check that it can show queue (/usr/local/maui/bin/showq).

Configuration

No configuration is needed because we have already done it during the configure script.

Conclusion

Maui can integrate easily with Torque, and makes a perfect scheduler/resource manager for you cluster running Gentoo! Too bad we don´t have the latest releases of both softwares in the portage tree, there´s only Maui, but if you enable the torque integration (pbs), it tries to emerge torque (which is an old release).

Now that I have this installed, I will see if I can create an ebuild and submit it to the Gentoo guys.

Update: Torque 2.2.1 is available via portage since 11/24/07 (sys-cluster/torque).

Troubleshooting

Related Links

Article created by : Paragao

Retrieved from "http://www.gentoo-wiki.info/HOWTO_Torque/Maui_-_grid_scheduler_and_resource_manager"

Last modified: Tue, 19 Aug 2008 11:01:00 +0000 Hits: 14,448