Process Load Distribution

Process Load Distribution

Distributed scheduling is concerned with distributing the load of a system among the available resources in a manner that improves overall system performance and maximises resource utilisation.

The basic idea is to transfer tasks from heavily loaded machines to idle or lightly loaded ones.

Load distributing algorithms can be classified as static, dynamic or adaptive. Static means decisions on assignment of processes to processors are hardwired into the algorithm, using a priori knowledge, perhaps gained from analysis of a graph model of the application. Dynamic algorithms gather system state information to make scheduling decisions and so can exploit under-utilised resources of the system at run time but at the expense of having to gather system information. An adaptive algorithm changes the parameters of its algorithm to adapt to system loading conditions. It may reduce the amount of information required for scheduling decisions when system load or communication is high.

Most models assume that the system is fully connected and that every processor can communicate, perhaps in a number of hops, with every other processor although generally, because of communication latency, load scheduling is more practical on tightly coupled networks of homogeneous processors if the workload involves communicating tasks. In this way also, those tasks might take advantage of multicast or broadcast features of the communication mechanism.

Processor allocation strategies can be non migratory(non preemptive) or migratory(preemptive). Process migration involves the transfer of a running process from one host to another. This is an expensive and potentially difficult operation to implement. It involves packaging the tasks virtual memory, threads and control block, I/O buffers, messages, signals and other OS resources into messages which are sent to the remote host, which uses them to reinitiate the process. Non migratory algorithms involve the transfer of tasks which have not yet begun, and so do not have this state information, which reduces the complexities of maintaining transparency.

Resource queue lengths and particularly CPU queue length are good indicators of load as they correlate well with the task response time. It is also very easy to measure queue length. There is a danger though in making scheduling decisions too simplistic. For example, a number of remote hosts could observe simultaneously that a particular site had a small CPU queue length and could initiate a number of process transfers. This may result in that site becoming flooded with processes and its first reaction might be to try to move them elsewhere. As migration is an expensive operation, we do not want to waste resources (cpu time and communication bandwidth) by making poor choices which result in increased migration activity.

A load distributing algorithm has, typically, four components:- transfer, selection, location and information policies.

Transfer Policy

When a process is a about to be created, it could run on the local machine or be started elsewhere. Bearing in mind that migration is expensive, a good initial choice of location for a process could eliminate the need for future system activity. Many policies operate by using a threshold. If the machine's load is below the threshold then it acts as a potential receiver for remote tasks. If the load is above the threshold, then it acts as a sender for new tasks. Local algorithms using thresholds are simple but may be far from optimal.

Process Selection Policy

A selection policy chooses a task for transfer. This decision will be based on the requirement that the overhead involved in the transfer will be compensated by an improvement in response time for the task and/or the system. Some means of knowing that the task is long-lived will be necessary to avoid needless migration. This could be based perhaps on past history.

A number of other factors could influence the decision. The size of the task's memory space is the main cost of migration. Small tasks are more suited. Also, for efficiency purposes, the number of location dependent calls made by the chosen task should be minimal because these must be mapped home transparently. The resources such as a window or mouse may only be available at the task's originating site.

Site Location Policy

Once the transfer policy has decided to send a particular task, the location policy must decide where the task is to be sent. This will be based on information gathered by the information policy.

Polling is a widely used sender-initiated technique. A site polls other sites serially or in parallel to determine if they are suitable sites for a transfer and/or if they are willing to accept a transfer. Nodes could be selected at random for polling, or chosen more selectively based on information gathered during previous polls. The number of sites polled may vary.

A receiver-initiated scheme depends on idle machines to announce their availability for work. The goal of the idle site is to find some work to do. An interesting idea is for it to offer to do work at a price, leaving the sender to make a cost/performance decision in relation to the task to be migrated.

Information Policy

The information policy decides what information about the states of other nodes should be collected and where it should be collected from. There are a number of approaches:

Demand Driven

A node collects the state of other nodes only when it wishes to become involved in either

sending or receiving tasks, using sender initiated or receiver initiated polling schemes.

Demand driven policies are inherently adaptive and dynamic as their actions depend on the

system state.

Periodic

Nodes exchange information at fixed intervals. These policies do not adapt their activity to

system state, but each site will have a substantial history over time of global resource usage to

guide location algorithms. Note that the benefits of load distribution are minimal at high

system loads and the periodic exchange of information therefore may be an unnecessary

overhead.

State-Change Driven

Nodes disseminate state information whenever their state changes by a certain amount. This

state information could be sent to a centralised load scheduling point or to peers.

Design Issues for Processor Allocation Algorithms

The major design decisions can be summed up as follows:

Deterministic versus Heuristic

Deterministic algorithms are appropriate when everything about process behaviour is known in advance. If the computing, memory and communication requirements of all processes can be established, then a graph may be constructed depicting the system state. The problem is to partition the graph into a number of subgraphs according to a stated policy so that each subgraph of processes is mapped onto one machine. This is achieved subject to the resource constraints imposed by each machine.

Heuristics refer to principles used in making decisions when all possibilities cannot be fully explored. For example, consider a site location algorithm where a machine sends out a probe to a randomly chosen machine, asking if its workload is below some threshold. If not, it probes another and if no suitable host is found within N probes, the algorithm terminates and the process runs on the originating machine.

Centralised versus Distributed Algorithms

Centralised algorithms gather all information to a single location before reaching a decision. Although better decisions can be made with access to global information, the approach is not scaleable because it generates a bottleneck and is not as resilient to failure as decentralised algorithms. A hierarchical organisation is one approach to decentralisation.

Optimal versus Suboptimal Algorithms

Optimal solutions require extensive system information and devote significant time to analysis. A study of the complexity of this analysis versus the success of solutions might reveal that a simple suboptimal algorithm can yield acceptable results and will be easier to implement.

Deterministic algorithms impose excessive costs on all modules within the policy hierarchy and are not scaleable, but can achieve optimal results. Heuristic techniques are invariably less expensive and often demonstrate acceptable but suboptimal results.

Local versus Global Algorithms

When a process is being considered for migration and a new destination is being selected, there is a choice between allowing this decision to be made in isolation by the current host or, to require some consideration of the status of the intended destination. It may be better to collect information about load elsewhere before deciding whether the current host is under greater pressure or not.

One globally continuous technique, is to associate an 'income' with each process so that the larger this value is relative to other processes at this site, the greater the percentage of processor cycles received. This income is adjusted based on the operational characteristics revealed by the process to the operating system. For heavily loaded sites the percentage of processor cycles received makes poor economic sense and processes migrate to where they can obtain more cycles per current income.

Sender Initiated versus Receiver Initiated Algorithms

Maintaining a complete database of idle or lightly loaded machines for migration can be a challenging problem in a distributed system. Various alternatives to this have been proposed.

With sender initiated schemes the overloaded machine takes the initiative by random polling or broadcasting and waiting for replies. With receiver initiated schemes an idle machine offers itself for work to a group of machines and accepts tasks from them.

In some situations a machine can become momentarily idle even though on average it is reasonably busy. As discussed earlier, care must be taken to coordinate migration activity so that senders do not capitalise on this transient period and send a flood of processes to this node. One approach is to assign each site a local load value and an export load value. The local load value reflects the true state of the machine while the export load value decreases more slowly to dampen short-lived fluctuations.

Process Migration

Although it is difficult to implement, a mechanism for process migration broadens the scope and flexibility of load sharing algorithms. Decisions on process to processor assignment can be dynamically reconsidered, in response to changing system conditions or user or process preferences. Tasks can be preempted and moved elsewhere rather than being statically assigned to a host until completion. Some of the motivating situations where such a mechanism could be useful are given next.

Load Balancing

Empirical evidence from some studies has shown that often, a small subset of processes running on a multiprocessor system often account for much of the load and a small amount of effort spent off-loading such processes may yield a big gain in performance.

Load sharing algorithms avoid idle time on individual machines when others have non-trivial work queues. Load balancing algorithms attempt to keep the work queues similar in length and reduce average response time, but are clearly more complex as they have more global and deterministic requirements. Commonly, load level is defined as CPU utilisation of hosts in a catchment area. A more elaborate index could be developed based on the contention for a number of resources and which employed a smoothing or dampening technique. An important metric of the load balancing policy is the anticipated residency period required by the process, to avoid useless and costly migrations. A competitive algorithm allows a process to migrate only after it executes for a period of time that equals or exceeds its predicted migration time. This idea is based on the probabilistic assumption that a process will exhibit a 'past-repeat' execution pattern. This approach has been shown to be simple to implement, provides an adequate method of imposing the residency requirement and adapts well to the current workload state of the participating processors.

Communication Performance

Network saturation can be caused by heavy communication traffic induced by data transfers from one task to another, residing on separate hosts. The interprocessor traffic is always the most costly and the least reliable factor in a loosely coupled distributed system. As a result of the saturation effect, the incremental increase in computing resources could actually degrade system throughput.

This situation can occur when processes perform data reduction on a volume of data larger than the process's size, e.g. a large database. In these circumstances, it may be advantageous to move the process to the data, rather than using network bandwidth, especially between geographically distant parties. The principle of spatial locality indicates that strongly interacting entities should be mapped in close proximity. In other words, the system must endeavour to match the logical locality with the physical locality.

It has been observed that for many parallel and distributed computations that it may be possible to determine communication loads at compile time and use this information to guide migration at runtime by including additional information in the process descriptor. To reduce IPC costs, the compiler may analyse the structure and phases of communication and suggest which processes should reside on the same machines. One technique is to associate a migration identifier with each process and to keep processes with equivalent migration identifiers on the one machine. A second technique (capitalising on compiler intelligence) known as eager migration can be used when it is known that a process A will soon migrate to B's location. Portions of process A may be piggybacked, in advance, on other messages bound for the same destination, to reduce migration time.

Fault Tolerance

Long running processes may be considered as valuable elements in any system because of the resource expenditure that has been outlaid already. Failure of mature processes becomes important if time and resources are precious and cannot be reassigned. Long running processes are particularly susceptible to transient hardware faults and may be worth rescuing by rapid evacuation. That is, a policy could be adopted which prioritised migrations from a machine about to shut down, based on the order of their computational value to the system. Such policies would require a high degree of concurrency within the migration mechanism to abandon a site quickly.

Application Concurrency

The divide and conquer, or crowd approach to problem solving decomposes a problem into a set of smaller problems, similar in nature, and solves them separately. The partial solutions are combined to form a final solution to the original problem. The smaller problems may be solved independently and concurrently in many cases using a copy or subset of initial data, exemplified by a tree or graph structure. Evaluating a game position, multiplying matrices or finding shortest paths can be performed in this way. Applications exhibiting high concurrency with little inter-task communication can benefit from these tasks being distributed to execute with true concurrency, on separate hosts. Processor pool architectures are particularly suited to these applications where migration may occur before processes establish locally.

Another situation related to concurrency is gang scheduling. Consider the situation where a process communicates via a pipe with another process and one of the processes has been chosen for migration. If the processes reside on different sites, then their relative scheduling can impede concurrent performance to the extent that they may benefit more from residing on the same site.

To demonstrate this, one process may be executing while the other is in a ready queue. The executing process fills the communication buffers and must then wait until they are emptied by the other process before being able to proceed. When the other process is scheduled it may process the communication buffers and then must wait until the other process is able to supply more data. This lock step execution may impede performance of the application and makes bulk transfers impossible. If both processes were scheduled in coordination using gang scheduling on the separate hosts, both consumer and producer could operate concurrently. If gang scheduling is not employed then perhaps no real concurrency is achieved and the network traffic is incurred for nothing.

The opposite is true for client/server interactions, where a client is the triggering process and the server is the reactive process. There may be no potential benefit to an application to be gained from allocating servers and clients to separate processors apart from the likelihood that services may be available on faster machines and are distributed for availability and performance.

Miscellaneous Reasons

Generally, migration makes it difficult to maintain transparency. In some circumstances though it can assist in this objective. For example, a resource may not be accessible remotely, or there may be insufficient resources available locally. Special purpose hardware such as array processors or graphical devices may be far too inefficient to use if the process was remote. A machine may have insufficient memory space for a new task. Without migration, these tasks would have to be terminated or execute very inefficiently.

Some systems allow both automated migration and application directed migration. It is interesting to allow a process to control its own migration activity. In addition, a process may be able to lock itself to a particular machine to prevent automated migration, but this means it must be aware of machine names. This may be to facilitate a user's security preference.

Other systems consider ownership of user workstations to be important and users prefer not to have foreign processes executing there when they are using their workstation. The load average daemons can detect when a user returns (interactive activity) and can evict processes using migration and send them home or elsewhere.

Migration Design Interdependencies

The migration mechanism provides an interface between the machine independent policy algorithm and the machine dependent process model of a specific kernel. For reasons of efficiency, flexibility and economy of implementation, the mechanism operates in close association with other kernel modules to obtain various statistical information and to manipulate local and remote kernel structures. Other aspects of system design therefore, can complement implementation strategies for process migration or make them unworkable.

Levels of Transparency

Transparency considerations influence the design of all subsystems within a physically distributed domain. Transparency is analogous to usability and enhances psychological acceptability of the operating system's interface. Even in centralised systems, I/O device characteristics are hidden from processes by encapsulating them in generic I/O abstractions. For process migration, there are a number of dimensions to transparency which combine to simplify implementation of relocatable distributed applications.

Access Transparency

Interface consistency is a general I/O design feature but often, to control the functional peculiarities of some devices, specialised operations must be provided. This may restrict access in certain situations to the locality supporting special resources. For general categories of I/O devices, the interface should be independent of location and must not create a distinction between local and remote resource access.

Location Transparency

The use of resource location identifiers, supplied as interface parameters must be avoided. With client/server relationships and the dynamic nature of process/host mapping, the migration of a process can cause a detachment at the application layer. Recovery may ensue, where a client may try to relocate a migrated server when communication fails, but this places an extra burden on the design of distributed applications.

Control Transparency

Control Transparency refers to automatic operating system control of process/host configurations. Control is independent of the mechanism and resides in a policy algorithm outside the kernel and is directed at specific system tuning strategies. This activity must occur without affecting the execution environment of any running processes.

Execution Transparency

Execution transparency affects the correct and continuous operation of the migrant specifically for the periods during and subsequent to a migration action. Freezing the migrant for an extensive period magnifies the probability of missing events, which subsequently can result in nondeterministic operation. The kernel must implement buffering and modify communication protocols to accommodate and tolerate delays which are generated by system activity. If migration is automatic, erratic response times can develop from seemingly 'transparent' system activity, resulting in user dissatisfaction. To avoid this, the mechanism must be designed efficiently for speed of transfer. In the event of failure or process crash, the owner of the process must be able to retrieve the exception information or debug the process whether it is local or remote. It is not always possible or desirable to shield all details of transparent execution from the owner.

Residual Dependencies

Processes are dependent on their host for access to system calls, physical resources such as memory and process interaction. Some kernel designs, such as unmodified Unix, make it impossible to detach a process from the complex interdependencies of its current environment. Even with kernels that do facilitate migration, the relocation of a process may not completely liberate the former host from maintaining some responsibility for it, therefore a residual dependency remains for some kernel functions. Residual dependencies can be deliberate in an effort to improve the migration mechanism performance. However, the more hosts that a process depends on for its execution reduces reliability.

Communication Dependency

When a process changes location, existing interactions must be rerouted. This is the most prevalent form of dependency. A number of methods for solving this have been proposed.

Alias responds with destination of migrant to guide retransmission

Alias relays message to expected location of B

Agent broadcasts B's new location for a fixed time and then exits

(a) A client driven scheme where the kernel installs a light weight alias process at the original location which responds to client interactions indicating that the process has migrated. The communication facility, on behalf of the client, must relocate the process with a broadcast and update caching information at the source kernel. Essentially, the communication information is used as a hint rather than as a definite address. The communication protocol uses the hint (or cached location information) by default for efficiency, but is able to deal with invalid hints transparently by operating a relocation protocol. Relocation protocols do not execute often as migration is also infrequent.

(b) A message forwarding scheme where a more intelligent agent process relays incoming messages. Continual redirection can be eliminated if the agent sends update information to source kernels as they reveal themselves.

(c) A server driven scheme where an agent repeatedly broadcasts the migrant's location for a fixed duration before self termination. A variation of this approach is to maintain information on all kernels which have interacted with the migrant and send them update messages.

(d) Associate a home node with every process through which it can always be located.

Address Space Dependency

The longer a process is frozen, awaiting establishment elsewhere, greater is the chance of failure. The transferal of the migrant's address space is the bottleneck of the operation. By leaving the address space (or part of it) behind, the process can be reestablished more quickly. The address space may be copied on reference from the original host as required by the process. This involves integrating the migration mechanism more closely with virtual memory management and the interprocess communication mechanism.

Execution Dependency

Some systems based on modified Unix kernels employ the concept of a home node which deals with all kernel specific operations. The purpose is to ensure complete transparency for the process when it moves from the environment of one kernel to another. A subset of environment dependent kernel calls are forwarded by the current host to the home node for execution. The home node may also forward any signals or messages to its dependant. Processes which make extensive use of their home kernel will incur performance penalties.

Concurrency

Simultaneous migrations introduce consistency difficulties among kernels, especially those managing connection oriented links where both ends of the link are changing. However, a mechanism with limited concurrency would be simpler to implement but could limit the flexibility of policy algorithms.

Three obvious levels of concurrency are conceivable:

One migration per network guarantees that the peers of the migrant are stationery, so message redirection is routine. Arbitration will be required to sequence migration activity. This degree of concurrency is not adequate for rapid evacuation of a failing host or to implement eviction properly. The arbitration mechanism can become a bottleneck to scalability.

Rapid evacuation becomes possible with unlimited migration but generates the extreme of kernel inconsistency and can lead to thrashing where processes are continually relocated due to rapidly changing load conditions at each site. Without a well directed policy to control migration, mechanism stability can be affected and flooding or abandoning of various sites is possible.

One migration per machine can mitigate the problems of flooding and simplify kernel state but does not cater for rapid evacuation.

Processes migrate concurrently from Alpha and Gamma to Beta

Thrashing results where processes must be moved on again

Migration Protocol

The migration protocol describes the sequence of events that occur once a process is selected for relocation. The protocol decides when the migrant is frozen and over what duration and at what point the migration is committed. It must marshal, transfer and demarshal process state and manage any environmental events that occur while the process is disabled. Some protocols may anticipate failure and build recovery actions.

There are three clear phases within a migration protocol, namely negotiation, transfer and establishment.

Negotiation

The bulk of negotiation procedures are carried out by the policy modules which select a candidate for migration and select a new host. Further negotiations by the mechanism are necessary to check that low-level resources are available within the structures of the remote kernel. A remote process descriptor, memory segments and communication resources must be allocated in preparation for transfer. Again this is a reason why the mechanism is best implemented as a thread operating in kernel space.

Transfer Phase

The transfer phase generally takes the longest and is the focus of most of the experimental techniques. Process state comprises three basic components which may be shipped separately in arbitrary order, or packaged together. These elements are, the process descriptor, the address space and messages (or signals) which occur while suspended. The process must be suspended before its descriptor is transferred and then reactivated when the minimal amount of state has been established. Long freeze times complicate IPC delivery semantics. Short freeze times require leaving residual dependencies and can cause future unpredictable delays which may affect real time response requirements. An interesting approach which allows short freeze times and avoids residual dependency is to precopy the bulk of the address space to the new site before the process is frozen. The remaining dirty pages can be transferred with the process descriptor.

Establishment Phase

The final phase tidies up the transfer and restores resource connections before returning the process to ready status. Once responsibility has transferred, the mechanism reclaims any low-level resources and real memory released by the migrant and flushes message buffers by retransmitting them or perhaps, depending on IPC semantics, allowing them to timeout and be retransmitted by the sender. The protocol may fail at any stage due to network or machine failure and rescuing processes under all circumstances requires complex recovery protocols. This can be offset by transferring responsibility for the migrant as late as possible in the protocol.

Statistics

Migration Policy is driven by the availability of information relating to resource utilisation locally and globally. A wide range of policies must be supported to suit the spectrum of behaviours exhibited by certain types of distributed application. Many statistics can be gathered from kernel accounting structures by periodic or event driven sampling. Simplistically, processes must be categorised as being CPU intensive, memory hungry, having high disk capacity or being involved in copious (possibly long distance) communication. From these relative indicators it is possible to design a weighted function to compute an overall load index to drive a particular policy. The relative power of various resources may be included in the equation. Mean response times, turnaround times and standard deviations for processes must be maintained to compare the effect of dynamic reconfigurations and to tune the host index. The resources required for the migration operation itself must be minimised and so an estimate of the size of a process's dirty address space and current system network throughput could be factors in choosing a migrant. Perhaps an ordered list of desirable migrants could be maintained to simplify the quick selection of processes to send off to sites which seek work.

In short, considerable CPU and communication resources will be required by a migration mechanism to provide adequate information for the implementation of administrative policies.

Load Scheduling Algorithms

Graph Theoretic Algorithms

Graph theoretic algorithms refer to systems of processes with known computation and communication costs. The idea is to perform the assignment of processes to processors to maximise resource utilisation to achieve either maximal concurrency or minimal communication. This has been shown to be NP-complete so heuristic approaches are used which, for example, could focus on tightly coupled clusters which have little interaction with other clusters.

Example: Tasks A..I are assigned to three CPUs. The edges of the graph which cut the dotted lines represent network communication cost units.

Network Traffic is 30 units

Network Traffic is 28 units

The precedence process model is suitable for analysing the task scheduling of a user application where precedence constraints among tasks are explicitly specified in the program. The primary objective in assigning tasks to processors is to achieve maximal concurrency for task execution within the program. The nodes contain information about task length and the edges indicate the number of message units to be communicated to each successor task.

The communication system model can be used to characterise the underlying architecture on which the tasks are to be mapped, showing unit communication costs between two processors. Intraprocessor communication is assumed to be negligible. For various assignments of tasks to processors the communication cost is easily computed from these models.

Without considering communication overhead, consider a simple greedy heuristic, used by the List Scheduling (LS) strategy:- No processor remains idle if there are some tasks available that it could process. For the precedence graph of the previous page, the resulting schedule would be as follows:-

The total completion time is 16 units. The critical path of the graph is the longest execution path and this defines the lower bound on completion time. It is useful to compare the performance of our heuristic approach against this and we see it is optimal in this case.

With communication overhead, the Extended List Scheduling algorithm (ELS) first allocates tasks to processors by applying LS. The communication costs are then computed from the unit interprocessor costs and the number of message units between tasks. The result for ELS would be as follows

The total completion time is 22 units. The ELS strategy is not optimal because the scheduling decision was made without anticipating the communication. Consider the Earliest Task First strategy (ETF) where the earliest schedulable task is scheduled first. Tasks A,B,C are schedulable immediately and so are scheduled onto processors 1,2,3. If communication costs are taken into account, task E becomes schedulable earliest on processor 1 at time 6, replacing the original choice of task D which will not be schedulable until time 8 at the earliest. Task F becomes schedulable on processor 2 at time 7, so task D is assigned to processor 3 at time 8. The earliest time task G can be scheduled is on processor 3 at time 14. The total completion time is 18 units.

The earliest schedulable time calculation is derived below:

The communication process model is suitable for analysing processes without precedence constraints but that are related by a need for communication. The processes do not have explicit completion times but have a cost associated with communication between them. The edges of the communication process graph (G) indicate the relative cost of communication units. There is also an execution cost which is a function of the processor to which the process is assigned.

The scheduling of various system processes/services which have been created independently fits well to this model. The primary objective is to minimise the overall cost of execution and communication.

More formally, a communication cost c_i,j(p_i, p_j) between two processes i and j executing on separate processors p_i and p_j is proportional to the weight of an edge connecting i and j.

The cost is negligible if i=j.

The execution cost e_j(p_i) is the cost of executing process j on processor i.

The objective is to find an optimal assignment of processes to P processors with respect to the following function:-

This general problem is NP-complete except for a few restricted cases where polynomial time solutions exist. It is complex because the objectives of minimising computation and communication costs often conflict and there maybe other allocation constraints.

Example:

Consider a system of 6 processes (1..6) to be allocated to 2 processors (A and B). A and B are different processors and so the execution of a process on either has a different cost. The graph indicates the cost of communication between two processes if located on different processors.

We can partition the graph with a line that cuts some edges, giving us two disjoint subgraphs, one for each processor. The edges intersected by the cut is called the cut set and the sum of the weights of these edges represents the total interprocess communication cost. Minimising the communication cost alone is trivial as we could map all processes to one processor, but obviously we may have to satisfy other constraints such as maximum processes allowed per processor or that some processes must be mapped to certain processors, or that the loads on each processor should be equal.

In our 6 process, 2 processor example, let's say that the execution costs of each process on each processor are given by the following table:-

Process	Cost on A	Cost on B
1	5	10
2	2	infinity
3	4	4
4	6	3
5	5	2
6	infinity	4

Note that process 6 is restricted to run on processor A and process 2 must run on processor B.

The minimum cost cut is given below:-

In the diagram, two new nodes representing processors A and B are included and dashed edges from each of these nodes to all of the processes. The dashed edges from processor A are weighted with the amount of execution cost that the corresponding process would incur on processor B. This is because if the cut intersects one of these edges then that process has been assigned to processor B and the cost of executing it on B would apply. Similarly, the dashed edges from processor B are weighted with the amount of execution cost that the corresponding process would incur on A.

It is difficult to generalise the problem beyond two processors. One heuristic approach is to divide the problem of optimising both communication and execution costs into two independent phases. We can merge processes with higher interprocess communication into clusters, to be scheduled onto a single processor to minimise communication costs. The clusters would then be scheduled onto the processor which offered the minimal execution time for the cluster.

We could invent a threshold for communication between processes, which if crossed causes them to be clustered. We could also invent a threshold cost of execution for the entire cluster (to prevent burdening of a single processor) which if exceeded forces the cluster to be divided among a number of processors or assigned away from the minimal cost processor.

In the example, if we use a communication threshold of 9 we can identify 3 clusters (2,4), (1,6), (3,5). (2,4) must be mapped to A because of 2, (1,6) must be mapped to B because of 6. (3,5) could be mapped to either A or B. Assigning to B incurs higher communication cost but lower execution cost.

Assigning to A results in an execution cost of 17 on A and 14 on B and communication cost of 10. The total cost therefore is 41, slightly higher than the optimal.

Dynamic Algorithms

The precedence process model and communication process model both assume some prior knowledge about process behaviour. This allows us to make centralised and static scheduling decisions which act to maximise application concurrency or minimise execution and communication costs.

The disjoint process model is used when this prior knowledge about process interaction is not available. Processes are scheduled as if they were independent of one another with perhaps the objectives of maximising resource utilisation and system throughput and being fair to user processes.

The scheduling algorithms can be dynamic using migration to relocate processes as required. Various approaches to this have already been discussed.

A sender-initiated nonpreemptive distributed heuristic algorithm employing migration is given below. The sender sends out a limited number of probes searching for an idle host. If one is found then the process is migrated. If no site is idle, but sites with shorter queues exist then the process is sent to the one with the shortest queue. Otherwise the process is executed locally.

SQ: Sender's queue length RQ: Queue length of polled receiver

ST: Threshold for migration

PL: Max Probes by Sender

Hierarchical Algorithms

A hierarchical processor structure places some processors in control of groups of processors beneath them in the hierarchy. Hierarchical structures scale well by simply increasing the number of levels in the hierarchy. Each processor keeps an estimate of the number of subordinate resources available to it. Tasks can be created at any level in the hierarchy. If the processor receiving the request decides that the resources available to it are insufficient it can pass the request up to its parent. This continues until a level with sufficient processing resources is reached. At that point the controlling processor divides the task and assigns its parts to processors immediately below it in the hierarchy. These in turn pass out the work among their subordinates until the wave of allocation hits the bottom of the hierarchy. The number of actual processing resources assigned is passed up the hierarchy to allow each processor to adjust its estimate of available resources. Of course, this is greatly complicated if a number of allocation requests are in progress at the same time, as workload estimates may be out of date. Another difficulty is developing a means of replacing a failed processor, perhaps by electing a subordinate to the role, or instead by using a group of processors at each level to make scheduling decisions.