Managing jobs remotely¶

Warning

Experimental code, use at your own risk.

The Manager encapsulates knowledge of how to run a job on a remote system. This basically means the host name, where are the scratch directories, what queuing system is running, etc. It simply uses ssh to communicate with the remote system and thus it is necessary to set up ssh with public key authentication to make this work smoothly.

The manager can move files between the local file system and the

remote scratch directories back and forth (using scp).

It can remotely launch a job (by running qsub).
It can also check on the progress by inspecting (in a primitive fashion) the log file of mdrun on the remote system.

The remote directory name is constructed in the following way:

topdir is stripped from the local working directory to give WDIR
scratchdir/WDIR is the directory on the remote system

get_manager() creates a Manager from a configuration file.

Configuration file¶

See Manager for how the values in the configuration file are used.

Example:

[DEFAULT]
name = leviathan

[local]
topdir = ~

[remote]
hostname = leviathan.petagrid.org
scratchdir = /scratch/username/Projects

[queuing_system]
name = PBS
qscript = leviathan.pbs
walltime = 24.0
start_cwd = True

All entries except walltime and start_cwd are required; walltime can be omitted or set to None.

DEFAULT section¶

name

identifier of the configuration; should also be the name of the configuration file, i.e. name.cfg

local section¶

topdir

path component under which the local files are stored; see below

remote section¶

hostname

fully qualified domain name of the host; used for running ssh hostname or scp FILES hostname:DIR

scratchdir

top level directory on the remote host udner which the working directories are constructed; see below for how this is done

queuing_system section¶

name

identifier for the queuing system (should be a valid python identifier)

qscript

default queuing system script template; store it in ~/.gromacswrapper/qscripts

walltime

maximum allowed run time on the system; job files are written in such a way that Gromacs stops run at 0.99 or walltime. If ommitted then the job runs until it is done (provided the queuing system policy allows that)

start_cwd

Set to True means that the queuing system requires the queuing system script to cd into the job directory; this seems to be a bug in some versions of PBS, which we can work-around in Manager.qsub()

Queuing system Manager¶

The Manager class must be customized for each system such as a cluster or a super computer. It then allows submission and control of jobs remotely (using ssh).

class gromacs.manager.Manager(name, dirname=None, **kwargs)¶

Base class to launch simulations remotely on computers with queuing systems.

Basically, ssh into machine and run job.

Derive a class from Manager and override the attributes

Manager._hostname (hostname of the machine)

Manager._scratchdir (all files and directories will be created under this scratch directory; it must be a path on the remote host)

Manager._qscript (the default queuing system script template)

Manager._walltime (if there is a limit to the run time of a job; in hours)

and implement a specialized Manager.qsub() method if needed.

ssh must be set up (via ~/.ssh/config) to allow access via a commandline such as

ssh <hostname> <command> ...

Typically you want something such as

host <hostname>
     hostname <hostname>.fqdn.org
     user     <remote_user>

in ~/.ssh/config and also set up public-key authentication in order to avoid typing your password all the time.

Set up the manager.

Arguments :	name configuration name (corresponds to a store cfg file) dirname directory component under the remote scratch dir (should be different for different jobs); the default is to strip topdir from the config file from the full path of the current directory prefix identifier for job names [MD]

job_done()¶: alias for get_status()

qstat()¶: alias for get_status()

cat(dirname, prefix='md', cleanup=True)¶

Concatenate parts of a run in dirname.

Always uses gromacs.cbook.cat() with resolve_multi = ‘guess’.

Note

The default is to immediately delete the original files (cleanup = True).

Keywords :	dirname directory to work in prefix prefix (deffnm) of the files [md] cleanup : boolean if `True`, remove all used files [`True`]

get(dirname, checkfile=None, targetdir='.')¶

scp -r dirname from host into targetdir

Arguments :	dirname: dir to download checkfile: raise OSError/ENOENT if targetdir/dirname/checkfile was not found targetdir: put dirname into this directory
Returns :	return code from scp

get_dir(*args)¶: Directory on the remote machine.

get_status(dirname, logfilename='md*.log', silent=False)¶

Check status of remote job by looking into the logfile.

Report on the status of the job and extracts the performance in ns/d if available (which is saved in Manager.performance).

Arguments :	dirname logfilename can be a shell glob pattern [md.log] silent* = True/False; True suppresses log.info messages
Returns :	`True` is job is done, `False` if still running `None` if no log file found to look at

Note

Also returns False if the connection failed.

Warning

This is an important but somewhat fragile method. It needs to be improved to be more robust.

local_get(dirname, checkfile, cattrajectories=True, cleanup=False)¶

Find checkfile locally if possible.

If checkfile is not found in dirname then it is transferred from the remote host.

If needed, the trajectories are concatenated using Manager.cat().

Returns :	local path of checkfile

log_RE = <_sre.SRE_Pattern object at 0x627b630>¶: Regular expression used by Manager.get_status() to parse the logfile from mdrun.

ndependent(runtime, performance=None, walltime=None)¶

Calculate how many dependent (chained) jobs are required.

Uses performance in ns/d (gathered from get_status()) and job max walltime (in hours) from the class unless provided as keywords.

n = ceil(runtime/(performance*0.99*walltime)

Keywords :	runtime length of run in ns performance ns/d with the given setup walltime maximum run length of the script (using 99% of it), in h
Returns :	n or 1 if walltime is unlimited

put(dirname)¶

scp dirname to host.

Arguments :	dirname to be transferred
Returns :	return code from scp

putfile(filename, dirname)¶

scp filename to host in dirname.

Arguments :	filename and dirname to be transferred to
Returns :	return code from scp

qsub(dirname, **kwargs)¶

Submit job remotely on host.

This is the most primitive implementation: it just runs the commands

cd remotedir && qsub qscript

on Manager._hostname. remotedir is dirname under Manager._scratchdir and qscript is the name of the queuing system script in remotedir.

Arguments :

dirname: directory, relative to the current one, under which the all job files reside (typically, this is also were the queuing system script qscript lives)
qscript: name of the queuing system script; defaults to the queuing system script hat was produced from the template Manager._qscript; searched in the current directory (.) and under dirname
remotedir: full path to the job directory on the remote system; in most cases it should be sufficient to let the programme choose the appropriate value based on dirname and the configuration of the manager

remotepath(*args)¶: Directory on the remote machine.

remoteuri(*args)¶: URI of the directory on the remote machine.

setup_MD(jobnumber, struct='MD_POSRES/md.pdb', **kwargs)¶

Set up production and transfer to host.

Arguments :	jobnumber: 1,2 ... struct is the starting structure (default from POSRES run but that is just a guess); kwargs are passed to `gromacs.setup.MD()`

setup_posres(**kwargs)¶

Set up position restraints run and transfer to host.

kwargs are passed to gromacs.setup.MD_restrained()

waitfor(dirname, **kwargs)¶

Wait until the job associated with dirname is done.

Super-primitive, uses a simple while ... sleep for seconds delay

Arguments :	dirname look for log files under the remote dir corresponding to dirname seconds delay in seconds during re-polling

Managing jobs remotely¶

Configuration file¶

DEFAULT section¶

local section¶

remote section¶

queuing_system section¶

Queuing system Manager¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Managing jobs remotely¶

Configuration file¶

DEFAULT section¶

local section¶

remote section¶

queuing_system section¶

Queuing system Manager¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation