Difference between revisions of "Son of Gridengine"

From wiki
Jump to: navigation, search
(Created page with "= Introduction = After Oracle bought Sun and then shut-sourced the Sun Grid Engine, Open Grid Engine made a release based SGEv5 in 2012 which hey called GE2011. However there...")
 
(edit /etc/bashrc)
 
(19 intermediate revisions by 2 users not shown)
Line 6: Line 6:
  
 
= Steps =
 
= Steps =
 +
 +
==Administrative host setup==
 +
 +
All nodes must be set up as administrative hosts, despite the fact that only the master seems to be "administrative"
  
  
Line 11: Line 15:
  
 
The RPMForge Extra repository are need for this. These can be installed via an RPM, and afterwards the Extra branch of the repo much be enabled as it is not enabled by default.
 
The RPMForge Extra repository are need for this. These can be installed via an RPM, and afterwards the Extra branch of the repo much be enabled as it is not enabled by default.
 +
 +
Note that it is best to disable this after all the RPMs have been installed
 +
 +
 +
== Install the Son of Gridware RPMs ==
 +
 +
Centos 7 requires the epel repo installed
 +
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
 +
 +
And the following packages:
 +
yum install jemalloc-3.6.0 lesstif-0.95.2 munge-libs-0.5.11 libdb4-utils-4.8.30
 +
 +
 +
 +
yum install -y gridengine-8.1.9-1.el6.x86_64.rpm gridengine-devel-8.1.9-1.el6.noarch.rpm gridengine-execd-8.1.9-1.el6.x86_64.rpm gridengine-qmaster-8.1.9-1.el6.x86_64.rpm gridengine-qmon-8.1.9-1.el6.x86_64.rpm
 +
 +
 +
From https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/
 +
 +
For a worker node this is rather excessive, but such is the nature of the binary-chek stage of the "install_exec" script that all of these are necessary.
  
 
== copying the default/common directory over to the node ==
 
== copying the default/common directory over to the node ==
  
== chown -R sgeadmin.gridware sge ==
+
**NOTE: Below is from Ramon, but when JW setup Phylo /opt/sge/default already existed, with the same date and file sizes as other nodes so not sure it's needed? DO CHECK PERMISSIONS. The user ought to be sgeadmin and the group gridware.**
 +
 
 +
First the  default directroy must be created:
 +
 
 +
ssh nodeX 'mkdir /opt/sge/default'
 +
 
 +
And then followed by:
 +
 
 +
scp -r common node8:/opt/sge/default
 +
 
 +
then run
 +
chown -R sgeadmin.gridware /opt/sge
 +
 
 +
==edit /etc/bashrc==
 +
 
 +
Add the following two lines to /etc/bashrc
 +
 
 +
 
 +
SGE_ROOT=/opt/sge; export SGE_ROOT;
 +
PATH=/opt/sge/bin/lx-amd64:$PATH
 +
 
 +
Note: on the older centos6 nodes the path is /opt/sge/bin/linux-x64, but the newer centos7 node had it as /opt/sge/bin/lx-amd64.
 +
 
 +
==Add the new server to the admin host==
 +
(following this: https://docs.oracle.com/cd/E19957-01/820-0697/i999062/index.html)
 +
So for phylo, run this on marvin:
 +
qconf -ah phylo
 +
 
 +
check it's been added
 +
qconf -sh
 +
should show it's been added.
 +
 
 +
 
 +
Move into /opt/sge/ and run in install_execd. Follow the instructions above. For phylo everything was default.
 +
 
 +
 
 +
Don't forget to add it to the queue's host
 +
 
 +
qconf -mq interactive.q
 +
 
 +
and add the hostname to the end of the list of hosts
 +
 
 +
= Administration =
 +
 
 +
==Creating a new parallel environment==
 +
 
 +
* Copy out an current parallel envioment out to a file
 +
* edit this file as you wish
 +
* execute
 +
 
 +
qconf -Ap <my_pe_file>
 +
 
 +
A crucial oversight is to forget that this new parallel environment needs to be inserted into the queue's configration.
 +
 
 +
==Queues and hostgroups ==
 +
 
 +
the dohfq.sh script accepts a rootname and list of numbers. The rootname becomes @rootname hostgroup and rootname.q for the queue.
 +
Node0 is in fact marvin. 1, is node 1 etc. These are the nodes to be ncluded in the new queue.

Latest revision as of 10:12, 15 March 2019

Introduction

After Oracle bought Sun and then shut-sourced the Sun Grid Engine, Open Grid Engine made a release based SGEv5 in 2012 which hey called GE2011. However there were no further releases. Then ARC at the University of Liverpool started releasing its "Son of Gridengine and have been maintaining updates to it at least as far March 2016.

Until September 2016, the Queue manager in the marvin cluster was GE2011 which was getting a bit old, so when the queue manager failed due to a corrupted database

Steps

Administrative host setup

All nodes must be set up as administrative hosts, despite the fact that only the master seems to be "administrative"


Getting the XML::Simple perl module

The RPMForge Extra repository are need for this. These can be installed via an RPM, and afterwards the Extra branch of the repo much be enabled as it is not enabled by default.

Note that it is best to disable this after all the RPMs have been installed


Install the Son of Gridware RPMs

Centos 7 requires the epel repo installed

yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

And the following packages:

yum install jemalloc-3.6.0 lesstif-0.95.2 munge-libs-0.5.11 libdb4-utils-4.8.30


yum install -y gridengine-8.1.9-1.el6.x86_64.rpm gridengine-devel-8.1.9-1.el6.noarch.rpm gridengine-execd-8.1.9-1.el6.x86_64.rpm gridengine-qmaster-8.1.9-1.el6.x86_64.rpm gridengine-qmon-8.1.9-1.el6.x86_64.rpm


From https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/

For a worker node this is rather excessive, but such is the nature of the binary-chek stage of the "install_exec" script that all of these are necessary.

copying the default/common directory over to the node

    • NOTE: Below is from Ramon, but when JW setup Phylo /opt/sge/default already existed, with the same date and file sizes as other nodes so not sure it's needed? DO CHECK PERMISSIONS. The user ought to be sgeadmin and the group gridware.**

First the default directroy must be created:

ssh nodeX 'mkdir /opt/sge/default'

And then followed by:

scp -r common node8:/opt/sge/default

then run

chown -R sgeadmin.gridware /opt/sge

edit /etc/bashrc

Add the following two lines to /etc/bashrc


SGE_ROOT=/opt/sge; export SGE_ROOT;
PATH=/opt/sge/bin/lx-amd64:$PATH

Note: on the older centos6 nodes the path is /opt/sge/bin/linux-x64, but the newer centos7 node had it as /opt/sge/bin/lx-amd64.

Add the new server to the admin host

(following this: https://docs.oracle.com/cd/E19957-01/820-0697/i999062/index.html) So for phylo, run this on marvin:

qconf -ah phylo

check it's been added

qconf -sh 

should show it's been added.


Move into /opt/sge/ and run in install_execd. Follow the instructions above. For phylo everything was default.


Don't forget to add it to the queue's host

qconf -mq interactive.q

and add the hostname to the end of the list of hosts

Administration

Creating a new parallel environment

  • Copy out an current parallel envioment out to a file
  • edit this file as you wish
  • execute
qconf -Ap <my_pe_file>

A crucial oversight is to forget that this new parallel environment needs to be inserted into the queue's configration.

Queues and hostgroups

the dohfq.sh script accepts a rootname and list of numbers. The rootname becomes @rootname hostgroup and rootname.q for the queue. Node0 is in fact marvin. 1, is node 1 etc. These are the nodes to be ncluded in the new queue.