Batch/Queueing software packages for Unix
The effective transition from a network of computers to a unified
cluster providing distributed computing services is achieved through use
of specialized system management software. This document outlines the
cluster management software that is currently available.
More information on high performance and parallel computing can be found
at NHSE (National HPCC Software Exchange).
Check their list of
Distributed Processing Tools.
Chart
http://www.erc.msstate.edu/thrusts/cps/hector/main/Compare.html
Commercial Systems
-
Argonne National Laboratory's SP Scheduling System
works for clusters of workstations and the IBM SP system, which have different
scheduling requirements.
- Autosys
PLATINUM AutoSys provides powerful job scheduling and management for
distributed UNIX and Windows NT environments. AutoSys delivers event-driven
scheduling, centralized real-time monitoring, and programmable error recovery.
- CODINE
(COmputing in DIstributed Networked Environments ) is a cluster management
package aimed at optimal utilization of the compute resources in
heterogeneous networked environments. Two methods of queue selection
are provided.The first is a simple first come first serve algorithm,
where the first queue in the list receives the job for execution. The
second method is to schedule by weighted load average within a group so
that the least busy node is selected to run the job. These methods of
selection are under the full control of the administrator. Resources are
managed to ensure that impact on a machine owner is minimal. CODINE can
suspend jobs if the console becomes active or the load average passes
a preset threshold. These jobs can then be migrated to other, less busy,
machines.
Availability :
CODINE may be purchased from:
GENIAS Software GmbH
Erzgebirgstr. 2B D-93073
Neutrabling, Germany ++49 9401 9200-0
- JP1/NQSEXEC by Hitachi.
JP1/System Base Facility and JP1/NQSEXEC represent a powerful combination,
providing batch job prioritization, queuing, scheduling and logging. Automatic
load balancing further improves efficiency.
- Load Leveler by IBM
features:
Distributed, full-function job scheduler;
Serial/parallel batch and interactive workload;
Workload balancing;
Multivendor UNIX** implementations;
Central point of control for workload administration;
Full scalability across processors and jobs;
API to enable alternate scheduling algorithms;
- LSF
(Load Sharing Facility) provides two daemons which handle remote execution
and job scheduling in a heterogeneous UNIX environment. Batch, interactive
and parallel execution functionality are built on top of these daemons. LSF
is aimed at distributing the workload around one or more large clusters of
workstations and operates by moving jobs around the cluster so that each
machine has an even load. Jobs are dispatched to the host that has the
lightest load and also satisfies the job resource requirements set by the
system and/or user. LSF determines the lightest load by examining CPU
utilization, paging rates, number of login sessions, interactive idle
time, available virtual memory , available physical memory, and available
disk space in the /tmp directory.
Availability :
LSF may be purchased from
Platform Computing Corporation
203 College St., Suite 201
Toronto, Ontario M5T 1P9, Canada
- Maestro by Unison.
Network Workload Scheduling for
UNIX & Microsoft Windows NT.
Features:
Provides rich feature set for initiating jobs & schedules;
Manages single systems or hundreds of systems simultaneously
across a mixed network;
Provides graphical & command line interfaces for monitoring,
including viewing through frameworks;
Integration with leading systems management frameworks &
business applications.
- Network Queueing Environment (NQE)
Cray Network Queuing Environment (tm) is the workload
management environment that provides batch scheduling and
interactive load balancing allowing customers to maximize the
utilization of their computational investments by managing, scheduling,
and controlling your entire enterprise-wide workload.
- QMASTER
is a client-server process management system designed for queuing distribution,
print spooling, document management, and control of batch processes across an
integrated heterogenous network.
- System Management Products (JP1/CS1).
The Hitachi offering to Open systems platforms is the JP1 suite of products,
followed up closely with the Clustering Systems management Partner 1 (CS1).
JP1 addresses the lack of adequate and flexible systems management in the Open
Systems environment, specifically for the Sun Solaris, IBM AIX, HP-UX, and
Windows NT platforms. This suite provides load balancing (across the defined
UNIX heterogeneous systems), batch operations, job (and script) scheduling,
smart printer sharing, automatic (and manual) backup and restore, performance
monitoring, and automated operations. The CS1 suite takes load balancing one
step further, and can make the bits and pieces of each system collectively
look like one virtual system.
- Task Broker
is a software tool that distributes computational tasks among heterogeneous
UNIX-system-based computer systems. Task Broker performs its computational
distribution without requiring any changes to the application. Task Broker
will relocate a job and its data according to rules set up at initialization.
Research Systems
- CCS Computing Center Software
The Computing Center Software is a distributed software
package for the management of parallel high performance
computing systems. It provides a seamless environment with
transparent access to a pool of parallel machines.
CCS is implemented as a multi-agent software, operating on
the Unix front-ends of the HPC systems to be managed.
- Condor
The goal of the Condor project is to develop, implement, deploy, and evaluate
mechanisms and policies that support High Throughput Computing (HTC) on
large collections of distributively owned computing resources. Guided by
both the technological and sociological challenges of such a computing
environment, the Condor Team has been building software tools that enable
scientists and engineers to increase their computing throughput.
- Distributed Job Manager (DJM)
is a job scheduling system designed to allow you to use
your massively parallel processor (MPP) more efficiently. DJM has demonstrated
improved MPP utilization of over 50% at some sites.
- DQS
(Distributed Queueing System) is a cluster management package which
provides the user with a batch environment containing different queues
based on architecture and group. All jobs are submitted to individual
queues to await execution. There are two methods of scheduling possible.
The first is to schedule on a first come first serve basis where the first
queue in the list receives the job for execution. The second method is
to schedule by weighted load average so that the least busy node is
selected to run the job. The method used is selected at compile time.
The impact on an owners machine is minimized by restricting the number
of jobs that can be run on the machine and by suspending jobs when
user activity is detected.
Availability :
DQS is in the public domain and is available for
anonymous ftp or from:
DQS (c/o Tom Green)
Supercomputer Computations Research Institute
Florida State University, B-186
Tallahassee, Florida 32306
- Hector (HEterogeneous Computing Task allocatOR)
. It is designed to run
MPI parallel programs on multiple workstations. By maintaining a database of
the performance of available workstations and by observing the run-time
performance of various tasks, it automatically allocates and migrates tasks to
maximize performance. It supports fault tolerance, job suspension, and many
other useful features.
- NetSolve
NetSolve is a project being developed at the University of Tennessee and
at the Oak Ridge National Laboratory.
The motivation behind NetSolve was to devise a fast, efficient,
easy-to-use system to effectively solve large computational problems,
regardless of the type of computer one happens to be using. Issues such as
Networking, Heterogeneity, Portability Numerical Computing Fault
Tolerance Load Balancing are all dealt with by the system freeing the
user to focus on other aspects of the application. NetSolve has been
designed to overcome hardware and software restrictions so that resources
can be available to any user anywhere on the network.
- Ninf
is an ongoing global network-wide computing infrastructure
project which allows users to access computational resources
including hardware, software and scientific data distributed across
a wide area network with an easy-to-use interface. Ninf is
intended not only to exploit high performance in network parallel
computing, but also to provide high quality numerical computation
services and accesses toscientific database published by other
researchers. Computational resources are shared as Ninf remote
libraries executable at a remote Ninf server. Users can build an
application by calling the libraries with the Ninf Remote
Procedure Call, which is designed to provide a programming
interface similar to conventional function calls in existing
languages, and is tailored for scientific computation. In order to
facilitate location transparency and network-wide parallelism,
Ninf metaserver maintains global resource information regarding
computational server and databases, allocating and scheduling
coarse-grained computation to achieve good global load balancing.
Ninf also interfaces with existing network service such as the
WWW for easy accessibility.
- NQS
Generic NQS is one of the world's leading freely-available batch processing
systems. Based on the de facto NQS standards, and inter-operable with
commercial NQS products, Generic NQS combines important features such as
cluster-wide dynamic scheduling with robustness, ease of installation, and
availability across a large number of UNIX-like platforms, including System 5
Release 4, Linux, and BSD 4.3.
-
Prospero Resource Manager (PRM)
supports the
allocation of processing resources in large distributed systems, enabling
users to run sequential and parallel applications on processors connected
by local or wide-area networks.
PRM is now being developed as part of the
Scalable
Computing Infrastructure (SCOPE) project at the
Information Sciences Institute of the
University of Southern California.
- WebSubmit (by NIST)
is an advanced Intranet application that provides a web page interface to
supercomputing applications. It differs from other web applications because it
allows interaction with a user's data files and directories on the target
supercomputer as if the user was logged on. The advantage of a web-based
interface is that it is hardware and software independent; it depends only
on whatever web browser the user has available.
All of the web pages are dynamically generated with CGI scripts written in Tcl.
- Portable Batch System (PBS)
- EASY
- The far Project
has developed a software tool specifically to facilitate the exploitation
of the spare processing capacity of UNIX workstations.
Other info pages