Batch/Queueing software packages for Unix

The effective transition from a network of computers to a unified cluster providing distributed computing services is achieved through use of specialized system management software. This document outlines the cluster management software that is currently available.

More information on high performance and parallel computing can be found at NHSE (National HPCC Software Exchange). Check their list of Distributed Processing Tools.

Chart
http://www.erc.msstate.edu/thrusts/cps/hector/main/Compare.html

Commercial Systems

Argonne National Laboratory's SP Scheduling System works for clusters of workstations and the IBM SP system, which have different scheduling requirements.
Autosys PLATINUM AutoSys provides powerful job scheduling and management for distributed UNIX and Windows NT environments. AutoSys delivers event-driven scheduling, centralized real-time monitoring, and programmable error recovery.
CODINE (COmputing in DIstributed Networked Environments ) is a cluster management package aimed at optimal utilization of the compute resources in heterogeneous networked environments. Two methods of queue selection are provided.The first is a simple first come first serve algorithm, where the first queue in the list receives the job for execution. The second method is to schedule by weighted load average within a group so that the least busy node is selected to run the job. These methods of selection are under the full control of the administrator. Resources are managed to ensure that impact on a machine owner is minimal. CODINE can suspend jobs if the console becomes active or the load average passes a preset threshold. These jobs can then be migrated to other, less busy, machines.
Availability : CODINE may be purchased from:
```
GENIAS Software GmbH
Erzgebirgstr. 2B D-93073
Neutrabling, Germany ++49 9401 9200-0
```
JP1/NQSEXEC by Hitachi. JP1/System Base Facility and JP1/NQSEXEC represent a powerful combination, providing batch job prioritization, queuing, scheduling and logging. Automatic load balancing further improves efficiency.
Load Leveler by IBM features: Distributed, full-function job scheduler; Serial/parallel batch and interactive workload; Workload balancing; Multivendor UNIX** implementations; Central point of control for workload administration; Full scalability across processors and jobs; API to enable alternate scheduling algorithms;
LSF (Load Sharing Facility) provides two daemons which handle remote execution and job scheduling in a heterogeneous UNIX environment. Batch, interactive and parallel execution functionality are built on top of these daemons. LSF is aimed at distributing the workload around one or more large clusters of workstations and operates by moving jobs around the cluster so that each machine has an even load. Jobs are dispatched to the host that has the lightest load and also satisfies the job resource requirements set by the system and/or user. LSF determines the lightest load by examining CPU utilization, paging rates, number of login sessions, interactive idle time, available virtual memory , available physical memory, and available disk space in the /tmp directory.
Availability : LSF may be purchased from
```
Platform Computing Corporation
203 College St., Suite 201
Toronto, Ontario M5T 1P9, Canada
```
Maestro by Unison. Network Workload Scheduling for UNIX & Microsoft Windows NT. Features: Provides rich feature set for initiating jobs & schedules; Manages single systems or hundreds of systems simultaneously across a mixed network; Provides graphical & command line interfaces for monitoring, including viewing through frameworks; Integration with leading systems management frameworks & business applications.
Network Queueing Environment (NQE) Cray Network Queuing Environment (tm) is the workload management environment that provides batch scheduling and interactive load balancing allowing customers to maximize the utilization of their computational investments by managing, scheduling, and controlling your entire enterprise-wide workload.
QMASTER is a client-server process management system designed for queuing distribution, print spooling, document management, and control of batch processes across an integrated heterogenous network.
System Management Products (JP1/CS1). The Hitachi offering to Open systems platforms is the JP1 suite of products, followed up closely with the Clustering Systems management Partner 1 (CS1). JP1 addresses the lack of adequate and flexible systems management in the Open Systems environment, specifically for the Sun Solaris, IBM AIX, HP-UX, and Windows NT platforms. This suite provides load balancing (across the defined UNIX heterogeneous systems), batch operations, job (and script) scheduling, smart printer sharing, automatic (and manual) backup and restore, performance monitoring, and automated operations. The CS1 suite takes load balancing one step further, and can make the bits and pieces of each system collectively look like one virtual system.
Task Broker is a software tool that distributes computational tasks among heterogeneous UNIX-system-based computer systems. Task Broker performs its computational distribution without requiring any changes to the application. Task Broker will relocate a job and its data according to rules set up at initialization.

Research Systems

CCS Computing Center Software The Computing Center Software is a distributed software package for the management of parallel high performance computing systems. It provides a seamless environment with transparent access to a pool of parallel machines. CCS is implemented as a multi-agent software, operating on the Unix front-ends of the HPC systems to be managed.
Condor The goal of the Condor project is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput.
Distributed Job Manager (DJM) is a job scheduling system designed to allow you to use your massively parallel processor (MPP) more efficiently. DJM has demonstrated improved MPP utilization of over 50% at some sites.
DQS (Distributed Queueing System) is a cluster management package which provides the user with a batch environment containing different queues based on architecture and group. All jobs are submitted to individual queues to await execution. There are two methods of scheduling possible. The first is to schedule on a first come first serve basis where the first queue in the list receives the job for execution. The second method is to schedule by weighted load average so that the least busy node is selected to run the job. The method used is selected at compile time. The impact on an owners machine is minimized by restricting the number of jobs that can be run on the machine and by suspending jobs when user activity is detected.
Availability : DQS is in the public domain and is available for anonymous ftp or from:
```
DQS (c/o Tom Green)
Supercomputer Computations Research Institute
Florida State University, B-186
Tallahassee, Florida 32306
```
Hector (HEterogeneous Computing Task allocatOR) . It is designed to run MPI parallel programs on multiple workstations. By maintaining a database of the performance of available workstations and by observing the run-time performance of various tasks, it automatically allocates and migrates tasks to maximize performance. It supports fault tolerance, job suspension, and many other useful features.
NetSolve NetSolve is a project being developed at the University of Tennessee and at the Oak Ridge National Laboratory. The motivation behind NetSolve was to devise a fast, efficient, easy-to-use system to effectively solve large computational problems, regardless of the type of computer one happens to be using. Issues such as Networking, Heterogeneity, Portability Numerical Computing Fault Tolerance Load Balancing are all dealt with by the system freeing the user to focus on other aspects of the application. NetSolve has been designed to overcome hardware and software restrictions so that resources can be available to any user anywhere on the network.
Ninf is an ongoing global network-wide computing infrastructure project which allows users to access computational resources including hardware, software and scientific data distributed across a wide area network with an easy-to-use interface. Ninf is intended not only to exploit high performance in network parallel computing, but also to provide high quality numerical computation services and accesses toscientific database published by other researchers. Computational resources are shared as Ninf remote libraries executable at a remote Ninf server. Users can build an application by calling the libraries with the Ninf Remote Procedure Call, which is designed to provide a programming interface similar to conventional function calls in existing languages, and is tailored for scientific computation. In order to facilitate location transparency and network-wide parallelism, Ninf metaserver maintains global resource information regarding computational server and databases, allocating and scheduling coarse-grained computation to achieve good global load balancing. Ninf also interfaces with existing network service such as the WWW for easy accessibility.
NQS Generic NQS is one of the world's leading freely-available batch processing systems. Based on the de facto NQS standards, and inter-operable with commercial NQS products, Generic NQS combines important features such as cluster-wide dynamic scheduling with robustness, ease of installation, and availability across a large number of UNIX-like platforms, including System 5 Release 4, Linux, and BSD 4.3.
Prospero Resource Manager (PRM) supports the allocation of processing resources in large distributed systems, enabling users to run sequential and parallel applications on processors connected by local or wide-area networks. PRM is now being developed as part of the Scalable Computing Infrastructure (SCOPE) project at the Information Sciences Institute of the University of Southern California.
WebSubmit (by NIST) is an advanced Intranet application that provides a web page interface to supercomputing applications. It differs from other web applications because it allows interaction with a user's data files and directories on the target supercomputer as if the user was logged on. The advantage of a web-based interface is that it is hardware and software independent; it depends only on whatever web browser the user has available. All of the web pages are dynamically generated with CGI scripts written in Tcl.
Portable Batch System (PBS)
EASY
The far Project has developed a software tool specifically to facilitate the exploitation of the spare processing capacity of UNIX workstations.

Batch/Queueing software packages for Unix

Commercial Systems

Research Systems

Other info pages